Super-fast flat file parsing in C# and Java with a perfect hash function

3 years ago

source link: https://blog.vermorel.com/journal/2015/2/16/super-fast-flat-file-parsing-in-c-and-java-with-a-perfect-ha.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Super-fast flat file parsing in C# and Java with a perfect hash function

At Lokad, (almost) all we do is to crunch flat text files. It’s not that we haven’t tried anything else - we did - many times - and it went poorly. Flat files are ubiquitous, well understood, and they yield very good performance both of the write side and the read side when working under tight budgets.

Keep in mind that the files we crunch are frequently generated by our clients, so while ProtoBuf or Cap’n Proto are very cool, asking our clients to deliver such formats would be roughly equivalent asking them to reimplement their in-house Java ERP in Haskell. To preserve the sanity of our clients, we keep it simple and we stick to flat files.

However we have decided to make flat file read fast, really fast. Thus, one of us decided to tackle the challenge dead-on, and came up with a very nice pattern: file parsing starts with a Perfect Hash Function preprocessing. Simply put, the flat file gets tokenized, and then each token gets replaced by an integer uniquely identifying this piece of string. Not only this saves a tremendous amount of string object instantiation, but afterward, all the complex parsing operations, such as parsing a date, can be performed only once, even if the token is encountered hundreds of times in the file. Performance-wise, it works because flat files tend to be very denormalized and very redundant.

We have released a tiny open source package codenamed Lokad.FlatFiles for C#/.NET (and a Java version too) under the MIT license. This library takes care of generating the perfect hashes out of a flat file. Our (unfair) benchmarks indicate that we typically reach about 30MB/second on a single CPU. Then, when the subsequent parsing operations take advantage of the token hashing, the speed-up is so massive that this initial perfect hashing tend to completly dominate the total CPU cost - so we stay at roughly 30MB/second.

Recommend

www.wondercms.com 6 years ago
Cache

WonderCMS - one of the smallest flat file CMS - What's new

Version 3.4.1 released January 22, 2023 This update brings 3 bug fixes, described below. We've also updated the Summernote plugin. ❤️ Support our free and open source miss...

Github github.com 3 years ago
Cache

A lightweight dex file parsing library

Dexbox A lightweight dex file parsing library Introduction ( 中文 ) Dexbox is a lightweight dex file pa...

www.youtube.com 2 years ago
Cache

Why 4:54 is the perfect speedrun - Super Mario Bros. World Record Speedrun Expla...

Why 4:54 is the perfect speedrun - Super Mario Bros. World Record Speedrun Explained16,756 views•Premiered 6 hours ago ...

Github github.com 2 years ago
Cache

GitHub - nolanlawson/optimize-js: Optimize a JS file for faster parsing (UNMAINT...

optimize-js Optimize a JavaScript file for faster initial execution and parsing, by wrapping all immediately-invoked functions or likely-to-be-invoked functions in parentheses.