Levenshtein Distance Between Cocktails

Cocktail similarity I was recently figuring out a minimum-viable bar setup for making cocktails at home, and a system for memorizing & recording recipes. When I started writing down the first basic ingredients, I started noticing that cocktails are very close to each other - if you ignore fruit rinds and ice and such, an Americano is a Negroni with soda water instead of gin. An Old Fashioned is a Manhattan with sugar instead of vermouth. So I wondered - what’s a cocktail edit distance? In the distant past I implemented Levenshtein distance for ‘biological computation,’ a 400-or-something level class that I kind of got through. Edit distances for DNA. Cocktail edit distance is a bit different - primarily because order doesn’t matter. It’s just set difference, or set distance. I haven’t really found much that explains this, probably because I don’t know which words to Google. So I initially dove into creating a graph visualization, but backtracked to an understandable-English first approach.

viewof route = form(html`<form style='padding-top:30px;'>
  <div><label><i>from</i> <select name="a">
${drinks.map((drink, i) => html`<option value=${i}>${drink.name}</option>`)}
</select></label>
  <label><i>to</i> <select name="b">
${drinks.map((drink, i) => html`<option value=${i}>${drink.name}</option>`)}
</select></label></div>
</form>`)

A is a but you .

Cocktail similarity graph Below is a non-directed graph of cocktails with their nearest neighbors. How to read this potentially zany graph: ingredient/ingredient: replacement, +ingredient: addition, -ingredient: removal. Also, you can grab the dots and move them.

settings = {
  const container = html`<div style='height:1000px;'></div>`;
  yield container;
  cy({
    width: width,
    container: container, // container to render in
    minZoom: 0.4,
    maxZoom: 2,

    elements: drinks
      .map(d => ({
        data: { id: d.id, label: d.name }
      }))
      .concat(firstLinks),

    layout: {
      name: "breadthfirst",
      minDist: 40
    },

    style: [
      // the stylesheet for the graph
      {
        selector: "node",
        style: {
          "background-color": "#aaa",
          label: "data(label)",
          "font-size": 20,
          width: 10,
          height: 10
        }
      },
      {
        selector: "edge",
        style: {
          label: "data(label)",
          "font-size": 14,
          width: 3,
          "edge-text-rotation": "autorotate",
          "line-color": "#ddd"
        }
      }
    ]
  });
}

The parts Data: a list of drinks. Mostly pulled from this IBA ‘Unforgettables’ list. Want to add something? Send me a suggestion! Thanks Kerry for the additional recipes!

drinks = [
  {
    name: "Negroni",
    ingredients: ["gin", "sweet vermouth", "campari"]
  },
  {
    name: "Jasmine",
    ingredients: ["gin", "cointreau", "campari", "lemon juice"]
  },
  {
    name: "Final Ward",
    ingredients: ["rye", "green chartreuse", "maraschino", "lemon juice"]
  },
    {
    name: "Mezcal Last Word",
    ingredients: ["mezcal", "green chartreuse", "maraschino", "lemon juice"]
  },
  
  {
    name: "Amber Road",
    ingredients: ["bourbon", "aperol", "lemon juice", "maple syrup"]
  },
  
    {
    name: "Fallback",
    ingredients: ["rye", "sweet vermouth", "apple brandy", "amaro nonino"]
  },
  {
    name: "Jack Rose",
    ingredients: ["apple brandy", "lemon juice", "grenadine"]
  },
  {
    name: "Gin Gimlet",
    ingredients: ["gin", "lime juice", "simple syrup"]
  },
  {
    name: "Vodka Gimlet",
    ingredients: ["vodka", "lime juice", "simple syrup"]
  },
     {
    name: "Sazerac",
    ingredients: ["rye", "cognac", "demerara", "absinthe"]
  },
   {
    name: "Contessa",
    ingredients: ["gin", "dry vermouth", "aperol"]
  },
    {
    name: "Hanky Panky",
    ingredients: ["gin", "sweet vermouth", "fernet-branca"]
  },
      {
    name: "Preakness",
    ingredients: ["rye", "sweet vermouth", "benedictine", "angostura"]
  },
  {
    name: "Whiskey Sour",
    ingredients: ["rye", "lemon juice", "simple syrup"]
  },
  {
    name: "Lucky Negroni",
    ingredients: ["gin", "salers", "green chartreuse"]
  },
  {
    name: "Key Party",
    ingredients: ["gin", "sweet vermouth", "amaro nardini", "green chartreuse"]
  },
   {
    name: "The Last Word",
    ingredients: ["gin", "lime", "maraschino", "green chartreuse"]
  },
  {
    name: "The Bad Word",
    ingredients: ["gin", "lime", "gran classico", "green chartreuse"]
  },
  {
    name: "fernetaboutit",
    ingredients: ["fernet", "lime", "maraschino", "green chartreuse"]
  },
  {
    name: "Art of Choke",
    ingredients: ["cynar", "rum", "lime", "demerara", "green chartreuse"]
  },
  {
    name: "Honey and Hearth",
    ingredients: ["bourbon", "lemon juice", "yellow chartreuse", "ginger liqueur"]
  },
  {
    name: "Boulevardier",
    ingredients: ["rye", "sweet vermouth", "campari"]
  },
    {
    name: "Third Man",
    ingredients: ["bourbon", "grapefruit", "campari", "lemon", "simple syrup"]
  },
    {
    name: "Greenpoint",
    ingredients: ["rye", "yellow chartreuse", "punt e mes", "angostura"]
  },
     {
    name: "Toronto",
    ingredients: ["whisky", "fernet", "simple syrup"]
  },
    {
    name: "Little Italy",
    ingredients: ["rye", "sweet vermouth", "cynar"]
  },
  {
    name: "Old Pal",
    ingredients: ["rye", "dry vermouth", "campari"]
  },
  {
    name: "Americano",
    ingredients: ["soda water", "sweet vermouth", "campari"]
  },
  {
    name: "Manhattan",
    ingredients: ["whiskey", "sweet vermouth", "bitters"]
  },
  {
    name: "Bourbon Manhattan",
    ingredients: ["bourbon", "sweet vermouth", "bitters"]
  },
  {
    name: "Old Fashioned",
    ingredients: ["bourbon", "simple syrup", "bitters"]
  },
  {
    name: "Martini",
    ingredients: ["gin", "dry vermouth"]
  },
  {
    name: "Vodka Martini",
    ingredients: ["vodka", "dry vermouth"]
  },
  {
    name: "White Lady",
    ingredients: ["gin", "triple sec", "lemon juice"]
  },
  {
    name: "Screwdriver",
    ingredients: ["vodka", "orange juice"]
  },
  {
    name: "Aviation",
    ingredients: ["gin", "maraschino", "lemon juice"]
  }
].map((drink, id) => ({ id, ...drink }))

A ‘greedy’ algorithm that links up this graph: each cocktail gets paired with its closest relative.

firstLinks = {
  let connected = new Set();
  let links = [];
  for (let i = 0; i < drinks.length; i++) {
    let startPoint = drinks[i];
    let closestDistance = Infinity;
    let closestDrink = undefined;
    let closestDiff = undefined;
    for (let drink of drinks) {
      // If we're comparing a drink to itself, or if we've already
      // linked these two, move along.
      if (drink === startPoint || connected.has(linkId(startPoint, drink))) {
        continue;
      }
      const diff = difference(drink, startPoint);
      if (diff.length < closestDistance) {
        closestDistance = diff.length;
        closestDiff = diff;
        closestDrink = drink;
      }
    }
    // Keep track of the ones we've linked.
    connected.add(linkId(startPoint, closestDrink));
    links.push({
      data: {
        source: startPoint.id,
        target: closestDrink.id,
        label: formatDifferenceShort(closestDiff)
      }
    });
  }
  return links;
}

An algorithm that figures out how to turn one cocktail into another in the fewest steps possible.

difference = (a, b) => {
  let ai = new Set(a.ingredients);
  let bi = new Set(b.ingredients);

  let removedItems = [];
  let addedItems = [];
  for (let item of ai) if (!bi.has(item)) addedItems.push(item);
  for (let item of bi) if (!ai.has(item)) removedItems.push(item);

  let changes = [];
  const replacementCount = Math.min(addedItems.length, removedItems.length);
  for (let i = 0; i < replacementCount; i++) {
    const removed = removedItems.pop();
    const added = addedItems.pop();
    changes.push({ type: "replacement", removed, added });
  }

  for (let added of addedItems) {
    changes.push({ type: "addition", added });
  }

  for (let removed of removedItems) {
    changes.push({ type: "removal", removed });
  }

  return changes;
}

html`<div style='padding:10px;background:#eef'>Incredible analysis below by <a href='https://beta.observablehq.com/@bryangingechen'>Bryan Gin-ge Chen</a>!</div>`

Distances between cocktails from distances between sets The size of the symmetric difference is one possible analogue of the edit distance to sets. The symmetric difference of two sets A and B is: We can define the symmetric difference distance as . It is what is called a "metric" on the collection of all finite sets. This means that it satisfies the following properties: (symmetry), , with equality if and only if A=B (distances are nonnegative, zero distance means the sets are identical), and (the triangle inequality). The first two are pretty natural properties for anything you might want to call a distance function. The third one is very natural as well, but it may take some thought if you haven't seen it before. In words, it says that the distance between two points A and C is never greater than the sum of the distances between A+B and B+C; it's something like "there are no shortcuts".

function symm_diff(a,b) {
  const ai = new Set(a.ingredients);
  const bi = new Set(b.ingredients);
  const a_b = new Set(a.ingredients.filter(d => !bi.has(d)));
  const b_a = new Set(b.ingredients.filter(d => !ai.has(d)));
  const union = new Set([...a_b].concat([...b_a]));
  
  return union.size;
}

Here's the matrix of symmetric difference distances between all pairs of cocktails. We'll come back to how to visualize this in a bit:

symm_mat = drinks.map((d) => drinks.map(e => symm_diff(d,e)))

An alternative measure arises from the Jaccard index, which is defined as the size of the intersection divided by the size of the union: The Jaccard distance is defined as . It is also a distance, though the triangle inequality is not obvious, see this MO thread for various proofs. Furthermore, the Jaccard distance between any pair of finite sets is at most 1. As you might expect, two sets are at the maximal distance 1 from each other if and only if they are disjoint.

function jaccard_distance(a,b) {
  const bi = new Set(b.ingredients);
  const inter = new Set(a.ingredients.filter(d => bi.has(d)));
  const union = new Set(a.ingredients.concat(b.ingredients));
  
  return 1 - inter.size / union.size;
}

Here's its distance matrix:

jacc_mat = drinks.map((d) => drinks.map(e => jaccard_distance(d,e)))

Well, now that we have these distance matrices, how do we make sense of them? This turns out to be the problem of multi-dimensional scaling (MDS). There's a neat blog post by Ben Frederickson introducing the topic, which also has some code for "classical multidimensional scaling" that I copied here:

// https://github.com/benfred/mds.js?files=1
// Copyright (C) 2013 Ben Frederickson

// Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

// The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
/// given a matrix of distances between some points, returns the
/// point coordinates that best approximate the distances using
/// classic multidimensional scaling
function mds(distances, dimensions) {
  dimensions = dimensions || 2;

  // square distances
  var M = numeric.mul(-0.5, numeric.pow(distances, 2));

  // double centre the rows/columns
  function mean(A) { return numeric.div(numeric.add.apply(null, A), A.length); }
  var rowMeans = mean(M),
      colMeans = mean(numeric.transpose(M)),
      totalMean = mean(rowMeans);

  for (var i = 0; i < M.length; ++i) {
    for (var j =0; j < M[0].length; ++j) {
      M[i][j] += totalMean - rowMeans[i] - colMeans[j];
    }
  }

  // take the SVD of the double centred matrix, and return the
  // points from it
  var ret = numeric.svd(M),
      eigenValues = numeric.sqrt(ret.S);
  return ret.U.map(function(row) {
    return numeric.mul(row, eigenValues).splice(0, dimensions);
  });
};

Basically what this does is takes a n by n distance matrix distances as well as an integer dimensions (defaults to 2) and attempts to generate an array of n points in dimensions-dimensions such that the Euclidean distances between those points approximate the original distances. This is useful for us because we can look at use MDS to assign points in 2D space to each of the cocktails and then make a picture! Here they are:

symm_mds = mds(symm_mat, n)

jacc_mds = mds(jacc_mat, n)

Let's check how close the resulting distances between these MDS-embedded points in the plane are to the original distances. The entries of the following matrices are the relative errors :

symm_mds_dist = symm_mds.map((p1,i) => symm_mds.map((p2,j) => {
  const dist = Math.sqrt(p1.reduce((a, v, k) => {
    return a + (p2[k]-v)*(p2[k]-v);
  }, 0));
  return Math.abs(dist - symm_mat[i][j])/symm_mat[i][j];
}))

jacc_mds_dist = jacc_mds.map((p1,i) => jacc_mds.map((p2,j) => {
  const dist = Math.sqrt(p1.reduce((a, v, k) => {
    return a + (p2[k]-v)*(p2[k]-v);
  }, 0));
  return Math.abs(dist-jacc_mat[i][j])/jacc_mat[i][j];
}))

Not too great, but hey, what else are we going to do? Try playing around with the slider below which increases the dimension of the space that the points are embedded in. Current dimension:

viewof n = html`<input type="range" min="2" max="15" value="2">`

Above 10 dimensions, nothing changes since there are only 11 points (compare this to the facts that any 2 points in N-dimensions lie on a line (1d object), any 3 points in N-dimensions lie in a plane (2d object), etc.). Sadly, even at 10 dimensions, the symmetric difference distance still isn't well-approximated. This presumably means that metric spaces of sets with that distance are somehow not very embeddable in ordinary Euclidean space, but I don't know how to make that precise...

add_mds = {
  drinks.forEach((d, i) => {
    d.pos_jacc = jacc_mds[i];
    d.pos_symm = symm_mds[i];
  });
  return md`:point_left: This cell just adds the MDS positions to the \`drinks\` object.`;
}

Anyways, I only know how to make 2D plots, so we're going to take the first 2 coordinates of the MDS embeddings and draw them below using scatterplot code taken from D3 Scatterplot. (Wanna take a shot at making 3D plots?) I've drawn in the above "edit" links for comparison, too. First, the symmetric difference distance (the two points squished off on the right are the White Lady and the Aviation -- more on this later!):

(add_mds, chart(drinks, "pos_symm"))

Then, the Jaccard distance:

(add_mds, chart(drinks, "pos_jacc"))

Now, why are the MDS embeddings putting the White Lady (gin, triple sec, lemon juice) very nearly on top of the Aviation (gin, maraschino, lemon juice)? Looking at the corresponding rows in the distance matrices gives part of an explanation: these two cocktails have precisely the same distances to all the other cocktails! (Try clicking on the inspected "Object"s below.) Intuitively, then, since this picture is just a 2D projection of some higher-dimensional embedding, the "dimension" in which these two cocktails differ is "out of the plane" and we just see the fact that their distances to all the other points are identical.

// White Lady is #7, Aviation is #10
({whitelady:jacc_mat[7], aviation:jacc_mat[10]})

// White Lady is #7, Aviation is #10
({whitelady:symm_mat[7], aviation:symm_mat[10]})

There are several other interesting features of these embeddings that would be interesting to look into (particularly with a larger data set to see if these persist): Both place the Manhattan and Bourbon Manhattan close together, as well as the Negroni and Martini, which makes sense from the point of view of the edit graph, I think. They both place the Vodka Martini and the Americano closer than one might expect. Whiskey sour is in fact quite far from its neighbor in the edit graph, the Manhattan.

function chart(data, tag) {
  const height = 500;
  const svg = d3.select(DOM.svg(width, height));

  const margin = { top: 20, right: 40, bottom: 30, left: 40 };

  const x = d3
    .scaleLinear()
    .domain(d3.extent(data, d => d[tag][0]))
    .nice()
    .range([margin.left, width - margin.right]);
  const xAxis = g =>
    g
      .attr("transform", `translate(0,${height - margin.bottom})`)
      .call(d3.axisBottom(x))
      .call(g => g.select(".domain").remove())
      .call(g =>
        g
          .append("text")
          .attr("x", width - margin.right)
          .attr("y", -4)
          .attr("fill", "#000")
          .attr("font-weight", "bold")
          .attr("text-anchor", "end")
          .text(data.x)
      );

  const y = d3
    .scaleLinear()
    .domain(d3.extent(data, d => d[tag][1]))
    .nice()
    .range([height - margin.bottom, margin.top]);

  const yAxis = g =>
    g
      .attr("transform", `translate(${margin.left},0)`)
      .call(d3.axisLeft(y))
      .call(g => g.select(".domain").remove())
      .call(g =>
        g
          .select(".tick:last-of-type text")
          .clone()
          .attr("x", 4)
          .attr("text-anchor", "start")
          .attr("font-weight", "bold")
          .text(data.y)
      );

  svg.append("g").call(xAxis);

  svg.append("g").call(yAxis);

  svg
    .append("g")
    .attr("stroke", "steelblue")
    .attr("stroke-width", 1.5)
    .selectAll("path")
    .data(firstLinks)
    .join("path")
    .attr(
      "d",
      d =>
        `M ${x(data[d.data.source][tag][0])} ${y(
          data[d.data.source][tag][1]
        )} L ${x(data[d.data.target][tag][0])} ${y(
          data[d.data.target][tag][1]
        )}`
    );

  svg
    .append("g")
    .attr("fill", "steelblue")
    .selectAll("circle")
    .data(data)
    .join("circle")
    .attr("cx", d => x(d[tag][0]))
    .attr("cy", d => y(d[tag][1]))
    .attr("r", 4);

  svg
    .append("g")
    .attr("font-family", "sans-serif")
    .attr("font-size", 12)
    .selectAll("text.halo")
    .data(data)
    .join("text")
    .attr("class", "halo")
    .attr("x", d => x(d[tag][0]) + 4)
    .attr("y", d => y(d[tag][1]) - 4)
    .attr("stroke-width", 2)
    .attr("stroke", "#fff")
    .attr("fill", "#fff")
    .text(d => d.name);

  svg
    .append("g")
    .attr("font-family", "sans-serif")
    .attr("font-size", 12)
    .selectAll("text.content")
    .data(data)
    .join("text")
    .attr("class", "content")
    .attr("x", d => x(d[tag][0]) + 4)
    .attr("y", d => y(d[tag][1]) - 4)
    .text(d => d.name);

  return svg.node();
}

firstLinks

Dependencies: using cytoscape.js to lay out that graph and using Mike’s form-input notebook to power those inputs.

linkId = (a, b) =>
  [a, b]
    .map(d => d.id)
    .sort()
    .join("-")

cy = require("[email protected]/dist/cytoscape.umd.js")

formatDifference = steps => {
  let parts = steps.map(step => {
    switch (step.type) {
      case "replacement":
        return `replace ${step.removed} with ${step.added}`;
      case "addition":
        return `add ${step.added}`;
      case "removal":
        return `remove ${step.removed}`;
    }
  });
  if (!parts.length) return "… they’re the same.";
  if (parts.length == 1) return parts;
  if (parts.length == 2) return parts.join(" and ");
  return (
    parts.slice(0, parts.length - 1).join(", ") +
    `, and ${parts[parts.length - 1]}`
  );
}

formatDifferenceShort = steps => {
  let parts = steps.map(step => {
    switch (step.type) {
      case "replacement":
        return `${step.removed}/${step.added}`;
      case "addition":
        return `+${step.added}`;
      case "removal":
        return `-${step.removed}`;
    }
  });
  return parts.join(", ");
}

import {form} from "@mbostock/form-input"

numeric = require("https://bundle.run/[email protected]")

d3 = require('d3@5')

TODO It’d be nice for the difference algorithm to know about ingredient types, so that it suggests, say, replacing vermouth with sugar (because they’re both there for making things way too sickly sweet), rather than replacing vermouth with vodka. More data, of course! Making that cytoscape graph look better. It doesn’t look very good. :white_check_mark: Specify type of vermouth (sweet/dry) Use https://twitter.com/joncamfield/status/1098389771817226240 for lots more data

Recommend

Key Value storage based on Ranged LSM tree

Taming Dependabot

Plots within plots with ggplot2 and ggmap

Designing an Atomic Block File Format

Of Microservices & Containers

More OCaml

大部分欧洲政府支持受争议的版权法

Redis 再次改变它的开源许可证

监控技术在中国创造了至少四个亿万富翁

古人即使写偷情，也是雅的。比如王次回，有《夜逢》诗，“花气依微露气浓，窄衫轻鬓夜...

About Joyk