Git Log in HTML: A Harder Problem and A Safe Solution
source link: http://www.oilshell.org/blog/2017/09/29.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
A Harder Problem and A Safe Solution
Git Log in HTML: A Harder Problem and A Safe Solution
This is an update to How to Quickly and Correctly Generate a Git Log in HTML.
I simplified the problem to make it easy to explain, but some reader comments indicate that I oversimplified it. So here I pose a harder problem that requires arbitrary text.
To be brief, the solution is like the one in the last post, except:
- Use git's
%x00
to insertNUL
bytes, because bash can't do this. - For adversarial input, check the number of
NUL
bytes before escaping.
Read on for details and the justification.
Recap of the Argument
The previous article had multiple goals:
- To explain a practical trick for escaping HTML in shell. I used this in real life!
- To make a point about programming style: I compared the naive vs. the pedantic solution.
- To explore a design space for the Oil shell. In particular, how should we communicate structured data between processes?
Regarding point #2, some of the comments missed the forest for the trees, so let me repeat the argument here:
- There is a naive way to solve this problem (the quick-but-unsafe way). A reader argued for ignoring escaping until you need it, so this is not a strawman.
- There is a pedantic way (the correct-but-clunky way). This also isn't a strawman. Another reader said they implemented something similar in Haskell using libgit2, but later switched to shell because of dependency problems (which was the issue I predicted).
- I presented middle-ground solution, which is both quick and correct (for
text). I used the trick of surrounding user-supplied data with
0x01
and0x02
bytes, and then a singlere.sub()
invocation in Python can escape those demarcated substrings. - I think the middle-ground solution is the best one, but it also has drawbacks. How can we address these issues in the Oil shell?
A Harder Variant of the Problem
The harder problem includes more fields, including the full git commit description. Here's an excerpt of example output.
4c353d1 | Wed Sep 20 00:44:45 2017 -0700 | Andy Chu |
Simplify, and add a git-specific solution. |
||
a089617 | Tue Sep 19 11:23:33 2017 -0700 | Andy Chu |
Modify example:
|
The Worst Part of Shell: Pushing the Problem Around
Although it's my fault for oversimplifying the problem, I think many of the alternative solutions given on Reddit, Lobsters, and Hacker News illustrate two of the worst parts of shell.
(1) The first problem is that a new, possibly buggy, solution is constructed for every text processing problem, depending on the format of the input data. For example, various solutions assumed:
- knowledge of particular fields, e.g. the git commit hash looks like
[0-9a-f]
. - that only one field contained spaces.
- that there are no newlines in any field.
It would be better to have a single solution that works for a wide range of problems.
(2) These kinds of assumptions may be valid in certain contexts, but the solutions don't check them. In other words, data validation and error checking are ignored.
If assumptions are violated, the script might keep chugging until it fails later, or it might succeed, in which case your end users are left to cope with it.
These are reasons that shell scripts have a reputation for being difficult and unreliable.
A Secure and Maintainable Solution
In section 4 of the last post, I admitted that I also pushed
the problem around. Instead of avoiding the HTML characters <
and >
, we
now avoid 0x01
and 0x02
. Those bytes are unlikely, but they're trivial to
insert in an adversarial context: git commit -m $'\x01'
.
I'm still pushing the problem around, but I argue that there's a uniquely
desirable place to push it to. Now I use a pair of NUL
bytes for
delimiters, so that:
- The only restriction we place on the untrusted input is that it can't contain
NUL
bytes (0x00
). This means we handle all UTF-8 text.
The new solution is in demo-multiline.sh in the oilshell/blog-code repository.
You can run it like this:
~/git/blog-code/git-changelog$ ./demo-multiline.sh write-file Wrote git-log-multiline.html
Here is an excerpt:
echo '<table>' # git inserts NUL bytes when you specify %x00 local num_fields=2 # 2 fields contain arbitrary text local format='<tr> ... <td> %x00%an%x00 </td> ... <td>%x00%B%x00</td> </tr> ' # Write raw data to a file local num_entries=5 git log -n $num_entries --pretty="format:$format" > tmp.bin # Check the raw data for the the right number of NULs expect-nul-count $((num_entries * num_fields * 2)) < tmp.bin # Escapes text between pairs of NULs, writing HTML to stdout. escape-segments < tmp.bin echo '</table>' }
For the case of generating release notes, I wouldn't bother checking for NUL
bytes.
But if I don't trust my input, it's easy to check. I don't think git change
descriptions can contain NUL
characters, but there's no guarantee.
C++ strings do allow internal NUL
s, and in an adversarial context, it's not
wise to rely on implementation details that may change.
Note 1: Bash Strings Can't Contain NUL
Notice that we're now using a git feature (%x00
) rather than a bash
feature ($'\x00'
). This is because bash is written in C and truncates the
string at the first NUL
:
$ mystr=$'abc\x00def' > echo ${#mystr} # length should be 7, not 3 3
Compare with Python:
>>> mystr = 'abc\0def' >>> print len(mystr) 7
Note 2: Searching for NUL Using Point-Free Style
Even if bash supported internal NUL
s, you still couldn't grep for them
like this:
$ grep $'\x00' myfile # matches the empty string on every line
That's because the kernel is also written in C, and the items in the argv
array passed to main()
are NUL
-terminated strings.
So instead I convert the file to hex digits with od
, then grep
and count
lines:
count-nul() { # -o puts every match on its own line. (grep -o -c doesn't work.) od -A n -t x1 | grep -o '00' | wc -l }
Aside: this function is written in point-free style. That means it doesn't mention variables, constants, or any other kind of data. I wrote about this in Pipelines Support Vectorized, Point-Free, and Imperative Style.
Lessons for Oil
This solution isn't perfect, but that gives me things to think about for the Oil language:
(1) Oil should be be able to store NUL
bytes in strings. This is trivial
because Oil is currently based on the Python VM.
(2) Oil should let you directly write loops like that check for NUL
. The od ... | grep ... | wc -l
solution is clever, but I'd rather be straightforward
and efficient.
Maybe something like:
proc expect-nul-count -- num_expected { var n = 0 # integer variable while (true) { var s = read(stdin, 4096) set n += s.count('\0') # .count() is borrowed from Python } test $n -eq $num_expected || die "Expected $num_expected NULs" }
(3) Left-to-right syntax would be nice. Instead of,
expect-nul-count $n < tmp.bin escape-segments < tmp.bin
perhaps:
open tmp.bin -- expect-nul-count $n open tmp.bin -- escape-segments
This can also be done with cat
, but some people don't like the useless use
of cat.
How Tools Should Integrate with Oil
I had originally thought that a broader ecosystem around the Oil shell would
needs its own implementation of tools like ls
and ps
. This is because I
want to transfer structured data over pipes.
But after thoroughly analyzing this problem, I think that the existing
convention of using NUL
bytes suffices for nearly all shell problems. It's
wise to avoid boiling the ocean.
Although bash doesn't properly support NUL
bytes, their use is already an
informal convention in shell:
git log
supports them with%x00
.find
in coreutils supports them with-print0
and-printf "%s\0%P\0"
.
On the other side:
xargs -0
splits its input withNUL
bytes
However, shell should also be able to handle binary data with arbitrary bytes. Possible solutions:
- Pass the path of a temp file that contains the data. The receiver calls
open()
andread()
. - Use the simple length-prefixed netstring format by DJB.
- base64-encode the data.
I'm not making any strong recommendations for external tools right now, but feel free to leave comments if there's anything I'm missing.
Conclusion
In this post, I posed a more general HTML escaping problem, and gave a simple and secure solution. If you find a problem with it, leave a comment.
This problem was worth thinking about because it made me realize that very little in the shell ecosystem needs to change in order to support structured data over pipes.
Tools can provide JSON or CSV output if they want. Oil will be able to understand those formats. But the minimum set of mechanisms we need is:
- Tools that use
printf
-style format strings should support something like%x00
. This should be an easy change to make. - The Oil shell should be able to store
NUL
bytes in the middle of strings.
Except for aesthetics, that's it. Oil may provide optional metaprogramming libraries to do transformations like this:
# The format string is lazily evaluated here, in the context of a # record. We should also do auto-escaping like as in JavaScript # ES6 template strings. git-log-wrapper | format '<td>$hash</td> <td>$description</td>'
But that's a topic for later.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK