How to Quickly and Correctly* Generate a Git Log in HTML

2017-09-19

9/29 Update: I slightly modified the solution given here.

In this post, I explain a short solution to a small text processing problem. Then I use it to illustrate the strengths and drawbacks of the Unix style of programming.

The Problem

For the OSH 0.1 release, I generated this HTML changelog:

/release/0.1.0/changelog.html

At first, I used a single command, which was roughly:

$ git log --pretty=format:"<tr> <td>%H</td> <td>%s</td> <tr>"

Notice that git log uses printf-style substitution, e.g. where %s is the commit subject and %H is the commit hash.

However, two commit descriptions happen to use the HTML metacharacters < and &. The format string above gives the wrong results:

<td>Implement <&, with test.</td>
<td>Alternative to using <& $fd</td>

Instead, they should be escaped like this:

<td>Implement <&, with test.</td>
<td>Alternative to using <& $fd</td>

The Pedantic Solution

Some programmers might stop here and say, Let's switch to a real programming language. Do it the right way.

In other words: Use a git API to retrieve structured data from the repository, then call an HTML escaping function on each string.

That solution seems "right", but it's onerous. Finding and installing a proper git API feels like yak shaving for this small task, and creates a new problem: dependencies.

We can have this debate later, but I still want to use shell. I don't want to turn a single git log invocation into a big program.

On the other hand, I don't want to be sloppy about escaping. That's a dangerous habit to get into.

A Short and Efficient Solution

Here's how I solved the problem:

Surround the % substitutions with the bytes 0x01 and 0x02. Bash supports C-style escapes within strings, e.g. $'\x01'. In this case we'll use:

$'<td>\x01%s\x02</td>'
Pipe to Python to HTML-escape everything inside pairs of 0x01 and 0x02 bytes.

In other words, it's of the form:

$ git log --pretty="format:..." | python -c 'print re.sub(...)'

Here's a simplified version of the code I used. It's also available in the oilshell/blog-code repository.

# Escape portions of standard input delimited by special bytes
escape-segments() {
  python -c '
import cgi, re, sys

print re.sub(
  r"\x01(.*)\x02", 
  lambda match: cgi.escape(match.group(1)),
  sys.stdin.read())
'
}

# Write an HTML table to stdout
git-log-html() {
  echo '<table>'

  local format=$'
  <tr>
    <td> <a href="https://example.com/commit/%H">%h</a> </td>
    <td>\x01%s\x02</td>
  </tr>'
  git log -n 5 --pretty="format:$format" | escape-segments

  echo '</table>'
}

The second argument to Python's re.sub is a replacement function. It takes match object and calls cgi.escape() on the first captured group.

This solution correctly escapes the shell operators I used in my description, e.g. Implement <& becomes Implement <&.

Is it Correct? Truly Adversarial Input

At least some of you are skeptical right now. Didn't I just push the problem around? I've prevented the obvious XSS attack:

git commit -m '<script>alert("hi")</script>'

But now I have a problem with 0x01 and 0x02, which can occur in in git commit descriptions.

However, I tried to attack my code with

git commit -m $'\x02<script>alert("hi")</script>\x01'

but it doesn't work because I used a greedy regex match (.*) and not a non-greedy one (.*?). A related subtlety is that, in this particular case, %s yields a single line of text, which helps because (.*) doesn't match newline characters.

This feels too clever to call secure. If you can construct a git message that causes the alert to fire, leave a comment. You can test your exploit by generating an HTML page with git-changelog/demo.sh in the oilshell/blog-code repository.

A More Severe Escaping problem

A bug in my solution would allow an Oil committer to run arbitrary code in the context of oilshell.org, which may or may not seem like a big deal.

But there was a recent case where trusting data in a git repository had far worse consequences:

CVE-2014-9938

In short, many developers display the current git branch in their bash prompt (including me). An attacker could create a branch name that would cause git-prompt.sh to execute arbitrary code:

Here's a proof a concept:

https://github.com/njhartwell/pw3nage

(Note: I didn't try this code.)

Conclusion

I like the git log | python solution, at least for this particular task. It's short and efficient.

However, it illustrates the downsides of using strings for everything. Although I wasn't able to construct an exploit, I'd hesitate to use this technique in an adversarial context. The CVE is further evidence that shell is a dangerous language.

I aim to rectify this with the Oil language. You should be able to have your cake and eat it too. There shouldn't be a tradeoff between quick, but unsafe and correct, but clunky!

Here's an outline of a future post on this topic:

Shell is dangerous because it fundamentally confuses code and data. I've given examples of this in previous posts:
Shell should be a powerful enough language to express any kind of escaping/quoting: HTML, JSON, shell, etc. It can barely do that now.
Shell needs a more powerful regex API, which would avoid piping to Python.
How should tools like git log be designed? The mini % language is a common pattern, but it's clunky and potentially unsafe, as this example shows.
Passing structured data like JSON over pipes could help, although it requires cooperation from both sides, and it can be verbose. This technique is occasionally used, but I wouldn't call it common.

Reminder

If you'd like to help me make a better Unix shell, please try the latest release on real shell scripts, and file bugs if it doesn't work. Thanks!

Appendix A: Other Unix Tools that Use Field Substitution

find and stat both have % languages that print file system metadata.

$ find . -printf 'relative path: %P\n'
...

$ stat -c 'name: %n'
...

curl's variant is more readable, with multi-character variable names:

$ curl -o /dev/null --write-out 'url: %{url_effective}' http://example.com
...

bash itself uses strings like \h \W for $PS1, the prompt string.
date uses strftime strings, but the values look safe to substitute without escaping.
Similarly, /usr/bin/time --format and the TIMEFORMAT variable for bash's time builtin use printf-style formatting, but the values are usually numbers, which don't need escaping in any context I can think of.

Leave a comment if you know of other tools that follow this pattern.

How to Quickly and Correctly* Generate a Git Log in HTML