7

Alternative Java regex for current bytes

 3 years ago
source link: https://www.codesd.com/item/alternative-java-regex-for-current-bytes.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Alternative Java regex for current bytes

advertisements

I have XML files (encoded in UTF-8) that have two issues:

  • Some of them (not all) contain a Byte order mark EF BB BF

  • Some of them (not all) contain Null characters 00, distributed over the whole file.

Both issues prevent me from parsing the XML with a SAX Parser. My current approach was to read the file into a String and use regex in order to extract these characters and write the string back to a file, which worked fine. However my files are quite large (hundreds of Megabytes) and reading the file into a String an creating a result String of the same size every time I call a replaceAll(), quickly leads to a java heap space error.

Increasing the heap size is definitely not a long term solution. I will need to stream the file and extract all these character on the fly.

Any suggestions on how an efficient solution should look like?


I would subclass FilterInputStream to filter out the undesired bytes at runtime.

The task should be rather easy as byte order marks are probably only at the start of the file (so you only need to check there) and nul-bytes can easily be flter with a simple == comparison (no need for regex-like features).

This will most likely also increase performance as you don't need to write out the full corrected file to disk before re-reading it.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK