# Explaining The Jaccard Metric

source link:https://rjlipton.wpcomstaging.com/2018/12/14/explaining-the-jaccard-metric/

Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

# Explaining The Jaccard Metric

*Why is it a metric?*

Paul Jaccard was a botanist who worked at ETH in Zurich during much of the first half of the 20th century. He created, or discovered, the similarity notion that became the Jaccard metric. Very neat having a metric named after you.

Today we discuss proofs and explanations that the metric is indeed a metric.

The Jaccard *index* is the ratio of to . The metric is

provided and are not both empty sets. If and are both empty then by definition is and so is . Generally we can assume that all the sets are non-empty.

The key question is to show that this satisfies the **triangle inequality**. That is, we must show that

Many proofs of this are known, and it has been remarked that some are fairly complicated. Some are short, but continued recent interest seems to say they haven’t satisfied as *explanations*.

We think we can supply a quick explanation, if you are already familiar with the triangle inequality holding for Hamming distance on sets:

where is symmetric difference. By rewriting [*] as we can see that [**] becomes similar but includes denominators. If [**] is *false* then

Now if we have then replacing both right-hand denominators by cannot make the right-hand side of [***] bigger. But then we have a common denominator, and we can see from Hamming distance that [**] must be true.

So suppose includes elements not in . The left-hand side of [***] is at most , so each right-hand fraction must be of the form where $latex {p \frac{p – b}{q – b}}&fg=000000$, so removing those elements from would also make the right-hand side of [***] smaller. That brings us back to the case and the previous contradiction. So [**] must be true. That’s our proof and explanation in brief.

## A Careful Take

We will do the above proof more slowly and carefully, to ensure it is really clear. A convention: to make the formulas a bit more readable we use to denote the intersection of and . So the Jaccard metric is now

First we explain the main ideas:

We will argue that the intermediate set in the triangle inequality can be constrained. The set can be a subset of . The intuition is that any extra elements in can only make the triangle inequality weaker.

We will replace the definition of the Jaccard metric by equivalent one. This new definition is much closer to a known metric.

We will reduce the triangle inequality finally to a known triangle inequality.

*Proof:*

Let’s assume that are the sets, and we wish to prove the triangle inequality:

**Claim**: We can assume that . Suppose there was an element in but not in . Then removing from would only tighten the triangle inequality. That is the LHS

stays the same and the RHS terms

can only decrease.

**Claim**: We can re-write as

As noted above, is the set of elements that are in or but not both.

**Claim**: We note that after applying the last claim three times, [**] becomes:

**Claim** : Since we can multiply by and get that the triangle inequality is implied by

Note this uses that

which follows from

**Claim** : But the last step is that

is just the triangle inequality for the Hamming distance. It uses that

holds for any single bits .

## How We Found The Proof

If you’re curious how we found the above, we were trying to check a different kind of proof. We started with the statement of the triangle inequality:

This looks a bit scary, with its multiple ratios and addition and subtraction. But the following feature jumps out:

The right side depends on but the left side does not.

This suggests the idea of asking what happens if we change one element at a time? Can we “walk” it to an extreme point at which the truth of [**] is obvious? A promising start was that we could remove any elements from that are not in , as shown above. So we can assume . Can we move toward or at least while preserving the implication of truth for [**]?

Seen in this light, our above proof’s replacing the denominators by is an “illegal move”—not a change to —so less interesting. But we happened to notice it worked. Let’s follow the train of thought from where we got that can be assumed. Okay, what the next move to make with ? Look again at the key expression:

Can we simplify this in some way? The answer is yes. The structure of minus an expression suggested that perhaps we could combine the and the ratios. Indeed it is not too hard to note that

Okay perhaps this is not obvious. It is not trivial, but it is a standard idea that looks like the complementation of the “probabilty” ratio . Once you think of this the exact formula follows. Thus we can re-write the required triangle inequality now as

We are almost done. The ratios are annoying, so can we get rid of them? We have assumed that . So it seems like a good idea to assume—for the moment—that is actually equal to . But then it is easy to see that and . So the above becomes after multiplying by ,

This looks really nice, no more ratios. Wait what is this expression? The is the exclusive-or function and it is not hard to note that this is the classic Hamming distance. So this inequality is a fact. Recall the Hamming distance records the number of differences between two bit-vectors, but sets are really just bit vectors.

Are we there yet? Almost. We only need to argue that if is less than the inequality we need is actually stronger. So we are done.

## Open Problems

Do you like our proof? Is it clear from the intro—or from the second section? Or are you still unsure why the Jaccard metric satisfies the triangle inequality? We may follow up with more about other proofs.

We don’t think the photo used by this online Alchetron bio of Paul Jaccard is the botanist. It looks too modern, for one. No other photo seems extant. Google Images guesses Alchetron’s image to be Adrian Herzog, but we think its highest Jaccard index is to Orepic user paul.jaccard, as shown at top. Can you solve the mystery of who?

[Switched to standard J-sub-delta notation to clarify and fix issues, fixed HTML conversion glitch with p-q fractions]

### Like this:

## Recommend

## About Joyk

Aggregate valuable and interesting links.

Joyk means Joy of geeK