4

Explaining The Jaccard Metric

 2 years ago
source link: https://rjlipton.wpcomstaging.com/2018/12/14/explaining-the-jaccard-metric/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Explaining The Jaccard Metric

December 14, 2018

Why is it a metric?

Paul Jaccard was a botanist who worked at ETH in Zurich during much of the first half of the 20th century. He created, or discovered, the similarity notion that became the Jaccard metric. Very neat having a metric named after you.

Today we discuss proofs and explanations that the metric is indeed a metric.

The Jaccard index {J(A,B)} is the ratio of {|A \cap B|} to {|A \cup B|}. The metric is

\displaystyle \text{[*]} \qquad  J_{\delta}(A,B) = 1 - J(A,B) = 1 - \frac{|A \cap B|}{|A \cup B|}

provided {A} and {B} are not both empty sets. If {A} and {B} are both empty then {J(A,B)} by definition is {1} and so {J_\delta(A,B)} is {0}. Generally we can assume that all the sets are non-empty.

The key question is to show that this satisfies the triangle inequality. That is, we must show that

\displaystyle  \text{[**]}\qquad J_\delta(A,C) \le J_\delta(A,B) + J_\delta(B,C).

Many proofs of this are known, and it has been remarked that some are fairly complicated. Some are short, but continued recent interest seems to say they haven’t satisfied as explanations.

We think we can supply a quick explanation, if you are already familiar with the triangle inequality holding for Hamming distance on sets:

\displaystyle  |A \oplus C| \leq |A \oplus B| + |B \oplus C|,

where {A \oplus B = (A \cup B) \setminus (A \cap B)} is symmetric difference. By rewriting [*] as {J_\delta(A,B) = |A \oplus B|/|A \cup B|} we can see that [**] becomes similar but includes denominators. If [**] is false then

\displaystyle  \text{[***]}\qquad \frac{|A \oplus C|}{|A \cup C|} > \frac{|A \oplus B|}{|A \cup B|} + \frac{|B \oplus C|}{|B \cup C|}.

Now if we have {B \subseteq A \cup C} then replacing both right-hand denominators by {|A \cup C|} cannot make the right-hand side of [***] bigger. But then we have a common denominator, and we can see from Hamming distance that [**] must be true.

So suppose {B} includes {b} elements not in {A \cup C}. The left-hand side of [***] is at most {1}, so each right-hand fraction must be of the form {\frac{p}{q}} where $latex {p \frac{p – b}{q – b}}&fg=000000$, so removing those elements from {B} would also make the right-hand side of [***] smaller. That brings us back to the {B \subseteq A \cup C} case and the previous contradiction. So [**] must be true. That’s our proof and explanation in brief.

A Careful Take

We will do the above proof more slowly and carefully, to ensure it is really clear. A convention: to make the formulas a bit more readable we use {AB} to denote the intersection of {A} and {B}. So the Jaccard metric is now

\displaystyle  J_\delta(A,B) = 1 - \frac{|AB|}{|A \cup B|}.

First we explain the main ideas:

{\bullet } We will argue that the intermediate set {B} in the triangle inequality can be constrained. The set {B} can be a subset of {A \cup C}. The intuition is that any extra elements in {B} can only make the triangle inequality weaker.

{\bullet } We will replace the definition of the Jaccard metric by equivalent one. This new definition is much closer to a known metric.

{\bullet } We will reduce the triangle inequality finally to a known triangle inequality.

Proof:

Let’s assume that {A,B,C} are the sets, and we wish to prove the triangle inequality:

\displaystyle  1 - \frac{|A C|}{|A \cup C|} \le 1-\frac{|A B|}{|A \cup B|} + 1-\frac{|B C|}{|B \cup C|}.

Claim: We can assume that {B \subseteq A \cup C}. Suppose there was an element {x} in {B} but not in {A \cup C}. Then removing {x} from {B} would only tighten the triangle inequality. That is the LHS

\displaystyle  1 - \frac{|A C|}{|A \cup C|}

stays the same and the RHS terms

\displaystyle  1-\frac{|A B|}{|A \cup B|} \text{ and } 1-\frac{|B C|}{|B \cup C|},

can only decrease.

Claim: We can re-write {J_\delta(X,Y)} as

\displaystyle  \frac{|X \oplus Y|}{|X \cup Y|}.

As noted above, {X \oplus Y} is the set of elements that are in {X} or {Y} but not both.

Claim: We note that after applying the last claim three times, [**] becomes:

\displaystyle  \frac{|A \oplus C|}{|A \cup C|} \le \frac{|A \oplus B|}{|A \cup B|} + \frac{|B \oplus C|}{|B \cup C|}.

Claim : Since {B \subseteq A \cup C} we can multiply by {|A \cup C|} and get that the triangle inequality is implied by

\displaystyle  |A \oplus C| \le |A \oplus B| + |B \oplus C|.

Note this uses that

\displaystyle  |A \oplus B| + |B \oplus C| \le |A \oplus B| \frac{|A \cup C|}{|A \cup B|} + |B \oplus C|\frac{|A \cup C|}{|B \cup C|},

which follows from

\displaystyle  1 \le \frac{|A \cup C|}{|A \cup B|} \text{ and } 1 \le \frac{|A \cup C|}{|B \cup C|}.

Claim : But the last step is that

\displaystyle  |A \oplus C| \le |A \oplus B| + |B \oplus C|

is just the triangle inequality for the Hamming distance. It uses that

\displaystyle  (x \oplus y) \le (x \oplus z) + (z \oplus y)

holds for any single bits {x,y,z}.

\Box

How We Found The Proof

If you’re curious how we found the above, we were trying to check a different kind of proof. We started with the statement of the triangle inequality:

\displaystyle  \text{[**]}\qquad 1 - \frac{|A C|}{|A \cup C|} \le 1-\frac{|A B|}{|A \cup B|} + 1-\frac{|B C|}{|B \cup C|}.

This looks a bit scary, with its multiple ratios and addition and subtraction. But the following feature jumps out:

The right side depends on {B} but the left side does not.

This suggests the idea of asking what happens if we change {B} one element at a time? Can we “walk” it to an extreme point at which the truth of [**] is obvious? A promising start was that we could remove any elements from {B} that are not in {A \cup C}, as shown above. So we can assume {B \subseteq A \cup C}. Can we move {B} toward {A \cup C} or at least {A \oplus C} while preserving the implication of truth for [**]?

Seen in this light, our above proof’s replacing the denominators by {|A \cup C|} is an “illegal move”—not a change to {B}—so less interesting. But we happened to notice it worked. Let’s follow the train of thought from where we got that {B \subseteq A \cup C} can be assumed. Okay, what the next move to make with {B}? Look again at the key expression:

\displaystyle  1 - \frac{|A C|}{|A \cup C|} \le 1-\frac{|A B|}{|A \cup B|} + 1-\frac{|B C|}{|B \cup C|}.

Can we simplify this in some way? The answer is yes. The structure of {1} minus an expression suggested that perhaps we could combine the {1} and the ratios. Indeed it is not too hard to note that

\displaystyle  1-\frac{|XY|}{|X \cup Y|} = \frac{|X \oplus Y|}{|X \cup Y|}.

Okay perhaps this is not obvious. It is not trivial, but it is a standard idea that {1 - p/q} looks like the complementation of the “probabilty” ratio {p/q}. Once you think of this the exact formula follows. Thus we can re-write the required triangle inequality now as

\displaystyle  \frac{|A \oplus C|}{|A \cup C|} \le \frac{|A \oplus B|}{|A \cup B|} + \frac{|B \oplus C|}{|B \cup C|}.

We are almost done. The ratios are annoying, so can we get rid of them? We have assumed that {B \subseteq A \cup C}. So it seems like a good idea to assume—for the moment—that {B} is actually equal to {A \cup C}. But then it is easy to see that {A \cup B = A \cup C} and {B \cup C = A \cup C}. So the above becomes after multiplying by {|A \cup C|},

\displaystyle  |A \oplus C| \le |A \oplus B| + |B \oplus C|.

This looks really nice, no more ratios. Wait what is this expression? The {\oplus} is the exclusive-or function and it is not hard to note that this is the classic Hamming distance. So this inequality is a fact. Recall the Hamming distance records the number of differences between two bit-vectors, but sets are really just bit vectors.

Are we there yet? Almost. We only need to argue that if {B} is less than {A \cup C} the inequality we need is actually stronger. So we are done.

Open Problems

Do you like our proof? Is it clear from the intro—or from the second section? Or are you still unsure why the Jaccard metric satisfies the triangle inequality? We may follow up with more about other proofs.

We don’t think the photo used by this online Alchetron bio of Paul Jaccard is the botanist. It looks too modern, for one. No other photo seems extant. Google Images guesses Alchetron’s image to be Adrian Herzog, but we think its highest Jaccard index is to Orepic user paul.jaccard, as shown at top. Can you solve the mystery of who?

[Switched to standard J-sub-delta notation to clarify and fix issues, fixed HTML conversion glitch with p-q fractions]

Like this:

Loading...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK