How to Quickly and Correctly* Generate a Git Log in HTML
source link: http://www.oilshell.org/blog/2017/09/19.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
How to Quickly and Correctly* Generate a Git Log in HTML
How to Quickly and Correctly* Generate a Git Log in HTML
9/29 Update: I slightly modified the solution given here.
In this post, I explain a short solution to a small text processing problem. Then I use it to illustrate the strengths and drawbacks of the Unix style of programming.
The Problem
For the OSH 0.1 release, I generated this HTML changelog:
At first, I used a single command, which was roughly:
$ git log --pretty=format:"<tr> <td>%H</td> <td>%s</td> <tr>"
Notice that git log
uses printf
-style substitution, e.g. where %s
is the
commit subject and %H
is the commit hash.
However, two commit descriptions happen to use the HTML metacharacters <
and
&
. The format string above gives the wrong results:
<td>Implement <&, with test.</td> <td>Alternative to using <& $fd</td>
Instead, they should be escaped like this:
<td>Implement <&, with test.</td> <td>Alternative to using <& $fd</td>
The Pedantic Solution
Some programmers might stop here and say, Let's switch to a real programming language. Do it the right way.
In other words: Use a git API to retrieve structured data from the repository, then call an HTML escaping function on each string.
That solution seems "right", but it's onerous. Finding and installing a proper git API feels like yak shaving for this small task, and creates a new problem: dependencies.
We can have this debate later, but I still want to use shell. I don't want to
turn a single git log
invocation into a big program.
On the other hand, I don't want to be sloppy about escaping. That's a dangerous habit to get into.
(Related: Master Foo and the Ten Thousand Lines.)
A Short and Efficient Solution
Here's how I solved the problem:
-
Surround the
%
substitutions with the bytes0x01
and0x02
. Bash supports C-style escapes within strings, e.g.$'\x01'
. In this case we'll use:$'<td>\x01%s\x02</td>'
-
Pipe to Python to HTML-escape everything inside pairs of
0x01
and0x02
bytes.
In other words, it's of the form:
$ git log --pretty="format:..." | python -c 'print re.sub(...)'
Here's a simplified version of the code I used. It's also available in the oilshell/blog-code repository.
# Escape portions of standard input delimited by special bytes escape-segments() { python -c ' import cgi, re, sys print re.sub( r"\x01(.*)\x02", lambda match: cgi.escape(match.group(1)), sys.stdin.read()) ' } # Write an HTML table to stdout git-log-html() { echo '<table>' local format=$' <tr> <td> <a href="https://example.com/commit/%H">%h</a> </td> <td>\x01%s\x02</td> </tr>' git log -n 5 --pretty="format:$format" | escape-segments echo '</table>' }
The second argument to Python's re.sub
is a replacement function. It takes
match
object and calls cgi.escape()
on the first captured group.
This solution correctly escapes the shell operators I used in my description,
e.g. Implement <&
becomes Implement <&
.
Is it Correct? Truly Adversarial Input
At least some of you are skeptical right now. Didn't I just push the problem around? I've prevented the obvious XSS attack:
git commit -m '<script>alert("hi")</script>'
But now I have a problem with 0x01
and 0x02
, which can occur in in git
commit descriptions.
However, I tried to attack my code with
git commit -m $'\x02<script>alert("hi")</script>\x01'
but it doesn't work because I used a greedy regex match (.*)
and not a
non-greedy one (.*?)
. A related subtlety is that, in this particular case,
%s
yields a single line of text, which helps because (.*)
doesn't match
newline characters.
This feels too clever to call secure. If you can construct a git message that
causes the alert to fire, leave a comment. You can test your
exploit by generating an HTML page with git-changelog/demo.sh
in the
oilshell/blog-code repository.
A More Severe Escaping problem
A bug in my solution would allow an Oil committer to run arbitrary code in the
context of oilshell.org
, which may or may not seem like a big deal.
But there was a recent case where trusting data in a git
repository had far
worse consequences:
In short, many developers display the current git branch in their bash prompt
(including me). An attacker could create a branch name that would cause
git-prompt.sh
to execute arbitrary code:
Here's a proof a concept:
(Note: I didn't try this code.)
Conclusion
I like the git log | python
solution, at least for this particular task.
It's short and efficient.
However, it illustrates the downsides of using strings for everything. Although I wasn't able to construct an exploit, I'd hesitate to use this technique in an adversarial context. The CVE is further evidence that shell is a dangerous language.
I aim to rectify this with the Oil language. You should be able to have your cake and eat it too. There shouldn't be a tradeoff between quick, but unsafe and correct, but clunky!
Here's an outline of a future post on this topic:
Shell is dangerous because it fundamentally confuses code and data. I've given examples of this in previous posts:
- Shell should be a powerful enough language to express any kind of escaping/quoting: HTML, JSON, shell, etc. It can barely do that now.
- Shell needs a more powerful regex API, which would avoid piping to Python.
- How should tools like
git log
be designed? The mini%
language is a common pattern, but it's clunky and potentially unsafe, as this example shows. - Passing structured data like JSON over pipes could help, although it requires cooperation from both sides, and it can be verbose. This technique is occasionally used, but I wouldn't call it common.
Reminder
If you'd like to help me make a better Unix shell, please try the latest release on real shell scripts, and file bugs if it doesn't work. Thanks!
Appendix A: Other Unix Tools that Use Field Substitution
find
andstat
both have%
languages that print file system metadata.
$ find . -printf 'relative path: %P\n' ... $ stat -c 'name: %n' ...
curl
's variant is more readable, with multi-character variable names:
$ curl -o /dev/null --write-out 'url: %{url_effective}' http://example.com ...
- bash itself uses strings like
\h \W
for$PS1
, the prompt string. date
usesstrftime
strings, but the values look safe to substitute without escaping.- Similarly,
/usr/bin/time --format
and theTIMEFORMAT
variable for bash'stime
builtin useprintf
-style formatting, but the values are usually numbers, which don't need escaping in any context I can think of.
Leave a comment if you know of other tools that follow this pattern.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK