17

Toby 'qubit' Cubitt - Lost or corrupted undo-tree history

 4 years ago
source link: http://www.dr-qubit.org/Lost_undo-tree_history.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Lost or corrupted undo-tree history

6 January 2020 Undo-tree is the most popular of the Emacs packages I wrote and maintain,11

not least because it's used by Evil and Spacemacs. However, for many years, the internet has been awash22

with reports of undo history being lost or corrupted when using undo-tree-mode. Trouble was, though I do all my text editing in Emacs, and always have global-undo-tree-mode enabled, I'd never encountered these issues myself. And no one had ever been able to send me a recipe for reliably reproducing them. Making it fiendishly difficult to even begin to investigate what might be going on.

This post is about some recent changes to undo-tree that might finally address these issues.

Lost undo history: the dreaded "no further undo information" message

Both Emacs' vanilla undo system and the undo-tree system print this message when there are no further changes to be undone in the buffer. Undo-tree users had been reporting for years that they were seeing this message, and weren't able to undo any further, even though they hadn't undone all the way back to the very first change they'd made to the buffer.

Finally, last year, user and Spacemacs contributor duianto very helpfully posted on the Spacemacs github issue tracker a simple recipe to reproduce this issue. Finally having a reliable way to reproduce this issue, it was then easy to investigate what was going on. And it turned out that …

undo-tree-mode was working completely correctly, and behaving exactly as documented.

When the size of the undo history recorded by Emacs grows larger than the limit set by the undo-limit configuration variable33

, Emacs discards the oldest changes recorded in the undo history, to bring the history size back below this limit. Undo-tree implements exactly the same behaviour, discarding the oldest changes when the size of the undo tree grows bigger than undo-limit. duianto's recipe generated enough changes for the undo history stored in the undo tree to grow larger than undo-limit. So undo-tree duly discarded the oldest changes.

One response to this would be to yell RTFM at everyone.

But I think a much more useful and instructive response is to consider why users thought this behaviour was a bug in undo-tree. Whereas (as far as I'm aware) no one reported this issue with vanilla Emacs undo, despite it discarding undo history in exactly the same way.

I can think of a few explanations:

  1. Undo-tree gave no indication to the user when undo history was being discarded. It just silently discarded the oldest history when it grew beyond undo-limit. So users weren't expecting undo history to have been discarded.

This alone doesn't explain it, as Emacs' vanilla undo system gives no such indication, either. But…

  1. Because of the way undo-tree-mode stores undo history as a tree of changes, it typically takes far fewer undo operations to before you run out of changes to undo. If you repeatedly undo, you run out of changes as soon as you reach the top of the tree. Other lines of undo history are stored in other branches of the tree, reached by switching to a different undo branch rather than repeatedly undoing.

    Whereas with vanilla Emacs' linear undo history, you have to wind all the way back through the entire undo history, undoing changes, then undoing previous undos, then undoing the undoing of the undos, and so on, before you reach the beginning of the undo history and run out of things to undo.44

  2. With vanilla Emacs undo, it's hard to know where you've got to in the undo history, or how far back you expect to reach. The very act of undoing creates more undo history. And as soon as you break a sequence of undos with any other command (even moving the point), you start right back from the beginning, undoing your recent undos (which is how vanilla Emacs implements redo operations).

    By making the undo history conceptually simpler to navigate, and even displaying the whole undo-tree visually, undo-tree-mode makes it much easier to know where you are in the undo history, where you want to get to, and therefore much more noticable if undo history has been discarded.

Viewed in this light, these non-bug reports are an indication that undo-tree is doing a good job of making Emacs undo more usable, but a bad job of meeting users' expectations. Apparently users expect undo history to be preserved indefinitely – not unreasonable given modern computing resources. So they perceive it to be a bug if any undo history is lost. And undo-tree makes it that much easier to notice that some undo history has been lost, compared to vanilla Emacs undo.

If software isn't matching user's expectations, there may not be a bug in the code. But in my view, there is still a bug: in the user experience. These possible explanations for the non-bug reports suggest two ways to address this in undo-tree: reduce user's expectations, or change undo-tree's behaviour to match user's expectations. In the latest undo-tree release, I've tried to do both.

Emacs famously solved the complaints about garbage-collection freezes due to its stop-the-world garbage collector by simply disabling the message telling users garbage collection was happening. The undo-tree history discarding issue seems to call for the opposite solution. Previously, undo-tree (like vanilla Emacs) didn't warn users when undo history was discarded due to breaching the undo-limit.55

In the new release, undo-tree displays a message in the echo area when undo history is discarded, and points users to the undo-limit configuration variable.

At the same time, users clearly expect more undo history to be preserved when using undo-tree-mode. The Emacs default for undo-limit is 80kB, which is very conservative. It's a very safe bet that any user running Emacs with undo-tree-mode these days has a lot more spare RAM than this. So the new version of undo-tree increases the default undo history size limit to 80MB when running undo-tree-mode. More precisely, there are three new configuration variables, undo-tree-limit, undo-tree-strong-limit and undo-tree-outer-limit, all with larger default values than their vanilla Emacs counterparts. undo-tree-mode sets the undo history size limit to whichever of undo-limit or undo-tree-limit is the larger (and similarly for the other configuration variables).

Finally, if you really don't want undo-tree to ever discard any undo history, no matter how large it grows, setting undo-tree-limit to nil will now make undo-tree preserve undo history indefinitely.66

I'm not sure this is a wise idea, particularly if coupled with the popular undo-tree-auto-save-history setting which persistently saves undo-tree history to file. Over time, undo history could easily grow large enough to become very slow to load and save, or even use. So this is not the default setting. But the option's there now for those who want to live on the edge.

Corrupted undo-tree history

The undo-history-discarding issue is only half the story, though. There has also been a steady trickle of reports of corrupted undo history, resulting in either errors or (worse still) garbled buffer text when undoing. Unfortunately, I still have yet to be sent a recipe for reliably reproducing this issue, and I can't reproduce it myself. So I'm still shooting in the dark on this one. However, I have a sneaking suspicion a race condition with the Emacs garbage collector might be responsible…

To understand why, you need to understand a little of how Emacs' undo system works behind the scenes, and how undo-tree manages to replace this at the Elisp level, without changing any Emacs c code. Whenever you make any modification to the text in a buffer, Emacs pushes a description of those changes onto the front of the list stored in the buffer-local buffer-undo-list variable. This variable gets special treatment during garbage collection. Whenever GC runs, it checks if the size of buffer-undo-list exceeds undo-limit (or one of the other undo size limits), and deletes enough older undo history from the tail of buffer-undo-list to bring it back within the configured size limits.

Much of this mechanism is hard-coded deep down in Emacs c code, and runs outside the context of the Elisp interpreter. Undo-tree stays well away from this mechanism, and doesn't mess with it at all. In undo-tree-mode, undo history still accumulates in buffer-undo-list. But whenever you call any undo-tree command (undo-tree-undo, undo-tree-redo, undo-tree-visualize etc.), the first thing it does is to transfer any undo history that has accumulated in buffer-undo-list since the last time, into the undo-tree package's own buffer-undo-tree data structure. In this way, undo-tree doesn't have to reimplement any of the code for recording undo history (which would anyway be very difficult to do from pure Elisp code). Instead, it can rely on Emacs' built-in, deeply embedded, thoroughly road-tested implementation of this.

However, this design means undo history continues to accumulate in buffer-undo-list until the next undo-tree command is called. If the user doesn't run any undo-tree commands for a long period, buffer-undo-list can easily grow larger than undo-limit and get garbage collected before any undo-tree command is called, i.e. before undo-tree even has a chance to see it.77

It's critical there are no gaps in the undo history; it must contain a continuous, unbroked chain of changes. Otherwise, undoing across a gap in the history would certainly lead to corrupted buffer text at best, errors at worse.

With older undo history still stored in buffer-undo-tree, somewhat more recent history deleted before undo-tree saw it, and the newest history still stored in buffer-undo-list, doesn't this mean undo-tree will inevitably end up corrupting the undo history? Is this the source of the bugs? No! It would be a very poor design if that were the case. The potential GC race condition is more subtle.

Let's think through the scenario where Emacs garbage collects undo history exceeding undo-limit before undo-tree gets to see it. Emacs only ever garbage collects just enough undo history to bring it back within the limit, in order to preserve as much undo history as possible. So the undo data remaining in buffer-undo-list after GC is as much as Emacs is allowed to store, according to the current undo-limit setting. Therefore, the correct thing for undo-tree to do in this scenario is to discard the contents of buffer-undo-tree entirely, and build a fresh undo tree from scrach using just the contents of buffer-undo-list. This freshly built tree will already contain as much undo history as undo-tree is allowed to store, according to the configured undo-limit.

All that's required is some mechanism to detect that Emacs has garbage collected undo history since undo-tree last saw the contents of buffer-undo-list. We could try to check the size of buffer-undo-list against undo-limit (and the other settings). But this would be very unreliable. Emacs only discards undo history in chunks ("undo changesets" in Emacs terminology); it doesn't discard individual changes. A very large changeset that just barely pushes buffer-undo-list over undo-limit would, when discarded, leave a buffer-undo-list that's much smaller than the limit. But the same undo history could equally well have been produced by a sequence of small changes that never pushed buffer-undo-list over undo-limit at all.

Instead, undo tree adds a special indicator entry to the very end of buffer-undo-list. When undo tree transfers undo history from buffer-undo-list to buffer-undo-tree, it checks whether this special indicator is still present or not.88

Since garbage collection deletes the oldest undo history from the end of buffer-undo-list first, the first thing that gets deleted is this special indicator entry. If it's missing, it means Emacs has discarded some undo history since undo-tree last saw it, and a fresh buffer-undo-tree should be built from the remaining contents of buffer-undo-list.

The undo-tree package's undo-list-transfer-to-tree function is the only interface between Emacs' low-level undo history recording mechanisms, and undo-tree's system. This is quite a simple function: all it does is pop entries off buffer-undo-list, and add them to buffer-undo-tree. It doesn't manipulate the undo data itself in any way.99

And when it comes to actually performing undo operations, undo-tree doesn't even do these itself! Rather, it passes this same undo data back to Emacs' own primitive-undo function – the very same function that Emacs' vanilla undo system uses. So there's very little scope for corrupting the undo data, and very little code footprint where this could occur. The simple undo-list-transfer-to-tree function is the only plausible candidate.

But Emacs' GC runs outside the Elisp interpreter. It can be triggered at any point whilst running Elisp code, between pretty much any pair of Elisp instructions. What if GC happens to run whilst undo-list-transfer-to-tree is in the middle of popping undo entries off buffer-undo-list? As long as GC deletes part of buffer-undo-list that comes after the entry undo-list-transfer-to-tree is currently processing, when it resumes after GC undo-list-transfer-to-tree will just notice the missing indicator when it gets to the end of buffer-undo-list, and build a fresh buffer-undo-tree from the undo history it's just processed. But if GC happens to delete part of the changeset undo-list-transfer-to-tree is in the middle of processing, it could end up transferring a partial undo changeset. This could potentially lead to corrupted undo history.

Worse still is if GC deletes part of buffer-undo-list starting from a point before the entry undo-list-transfer-to-tree has already reached. Emacs' mark-and-sweep GC shouldn't delete any elements reachable from Elisp, which in this scenario would include the whole of buffer-undo-list. But that variable is a glaring exception. Its contents are always reachable from Elisp, via buffer-undo-list itself. Yet Emacs needs to delete part of it during GC, nonetheless. So the Emacs garbage collector takes a blunt approach: cons cells get unlinked from buffer-undo-list by GC, regardless of whether something else is holding a reference to them.1010

In this scenario, when it resumes, undo-list-transfer-to-tree will continue processing cons cells that have been unlinked from buffer-undo-list and the behaviour is undefined, or at least undocumented.1111

Like many race conditions, this one is extremely hard to test. It requires GC to be triggered at just the right point. Moreover, GC runs outside the context of the Elisp interpreter. So from within Elisp or the Elisp debugger, you only get to see the state of buffer-undo-list and undo-list-transfer-to-tree before and after GC, but not during. However, when testing duianto's recipe for reproducing the undo-tree history discarding issue discussed above, I sporadically saw some odd behaviour that didn't make sense from Elisp, and which I couldn't reliably reproduce. And other comments on the same Spacemacs git issue tracker circumstantially hint at GC being implicated. In any case, whether it's responsible for the trickle of undo-tree corruption bug reports or not, this potential GC race condition is still something that needs addressing.

In the latest undo-tree release, I've implemented multiple measures to reduce the likelihood of – and eliminate the effects of – this race condition:

  1. undo-list-transfer-to-tree now manually triggers GC by calling the garbage-collect function, just before it starts processing buffer-undo-list.1212

    This should significantly reduce the likelihood of GC occurring again whilst in the middle of processing buffer-undo-list.

  2. Previously, undo-list-transfer-to-tree popped undo entries one by one from buffer-undo-list, and built new sections of the buffer-undo-tree data structure as it went along. This gave a relatively large time window when GC could be triggered whilst still in the middle of processing buffer-undo-list. In the new version, undo-list-transfer-to-tree first deep-copies the entire contents of buffer-undo-list to a new variable, then processes the copy. In this way, GC could only affect things if it occurs whilst copying, which should be a singificantly shorter time window.
  3. Finally, undo-tree now installs a function in the post-gc-hook which sets an undo-tree-gc-flag variable to t whenever GC runs. undo-list-transfer-to-tree resets this flag to nil just before deep-copying buffer-undo-list, and checks if it's still nil afterwards. If not, it keeps retrying, until it succeeds in copying buffer-undo-list without GC being triggered during copying.

The first two measures should significantly reduce the chance of hitting this GC race condition. The third should elliminate any possibility of bad side-effects, in the unlikely event GC does still run at a bad time.

As it's extremely difficult to reliably reproduce the race condition, let alone be sure if this is what's behind the unreproducible bug reports, it's very hard to test how successful these measures will be before the release. On the other hand, it's also possible that many or all of the bug reports related to the undo history discarding non-bug discussed above, which measures discussed there should alleviate.

The new release is now out on both GNU Elpa and here. Time will tell whether these measures stem the tide (or rather, divert the rivulet).


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK