Database and Theory

December 17, 2020

A dean of engineering

Jennifer Widom was appointed as the Dean of Engineering at Stanford not long ago. No doubt her research in databases and her teaching played a major role in her selection. For example, she is known for her contributions to online education including teaching one of the first massive open courses—MOOCS—which had over 100,000 students.

Today I thought we might highlight her area, databases, and one of its connections to theory.

But first, I was impressed to hear her say in an ACM podcast that she used punch cards during the start of her career:

So my passion in high school actually became music, and when it came time to select a college, I chose to go to music school. I’m pretty sure at this point, I’m the only dean of engineering anywhere who has a bachelor’s degree in trumpet performance, but that’s actually what my undergraduate degree is in. Late in my music education, I just sort of randomly took a class called computer applications and music research. And it was a class in the music school about using programming to analyze music, and it was my first exposure to computer programming. I have to say, it’ll reveal my age, but I used punch cards in that class. It was sort of the end of the punch card era.

I admit to having used these too at the start of my days as an undergrad at Case Institute. Do you know what a punch card is?

Ken once had his own experience with punch cards and music. While taking an intro programming course during his sophomore year—taught then by Jeffrey Ullman—Ken and his roommate joined an excursion to NYC for dinner and the New York Philharmonic. Still dressed in suits, on return to Princeton, they descended into the computer center to do their assignment—via punch cards on a batch server. Ken’s roommate got out before 2am, but Ken says he was still fixing punch card typos that ruined runs as late as 4am.

Data Management

Widom’s research has always been in the area of non-traditional databases: semi-structured data, data streams, uncertain data and data provenance.

We theorists view the world as made of questions like:

Upper Bounds Can we find a faster algorithm for problem X?
Lower Bounds Can we prove there are no faster algorithms for X? Sometimes we replace “faster” by other complexity measures, but these are the dominant questions we ask.
Widom proposed a notion that is called lineage. Theorists worry about the performance of searching, but this issue is about where did the data come from? See also provenance.

The insight boils down to garbage in and garbage out, or GIGO:

On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?”

Charles Babbage (1864).

The issue of lineage and provenance is exactly this made modern. How can one track where the data comes from? Of course, finding fast algorithms to do the tracking is still an important question. Oh well.

Databases Meet Theory

Of course databases store and allow us to retrieve information—they are everywhere these days. One reason for their dominance is they support complex queries. For example, we can use databases to obtain all ${x}$ so that ${R(x,y)}$ is true for some ${y}$ :

$\displaystyle \{ x \mid \exists y R(x,y) \}.$

From a theory point of view any first-order question is allowed.

But searches often use approximate matches. Thus one might wish all ${R(x',y)}$ so that ${x'}$ is near ${x}$ . We might want all ${R(x',y)}$ so that ${x'}$ is not “Michael” but also includes “Micheal”. This type of error is not too hard to handle, but a worse type of error is

$\displaystyle Michael \iff Michal$

Formally changing a letter is not as difficult as deleting or adding letters. This is the edit distance between two strings is defined as the minimum number of insertions, deletions or substitutions of symbols needed to transform one string into another. An important application is to biology.

This leads to into the more-general notion of entity resolution (ER). Widom co-leads the Stanford Entity Resolution Framework (SERF) with Hector Garcia-Molina; they were two of the authors on its earliest paper in 2006. The page highlights the lexical nature of ER:

For instance, two records on the same person may provide different name spellings, and addresses may differ. The goal of ER is to “resolve” entities, by identifying the records that represent the same entity and reconciling them to obtain one record per entity.

Their approach “involves functions that `match’ records (i.e., decide whether they represent the same entity) and `merge’ them.” Their framework has an outer layer that treats the functions as generic. Widom’s emphasis has been on axiomatizing properties of the functions to enable efficient and extensible manipulation by the outer layer.

Necessarily under the hood is a measure of edit distance. Ken recently monitored a large tournament where two players had the same surname and first three prenames, differing only in the last four letters of their last prename. The widget used to download the game file originally given to Ken truncated the names so that there seemed to be one person playing twice as many games. They could also have been distinguished by other database information. More common is when the same player is referenced with different spellings, such as (grandmaster Artur) Jussupow or Yusupov. The similarity metric needs to be extended beyond the name. This is when computation time becomes more a factor and theory enters in.

Edit Distance

Computing the edit distance between two strings is a long-studied problem. The best algorithm known is order quadratic time, as we covered here. There are better running times if we allow approximate algorithms or quantum algorithms. See this for the latter and this post for work related to the former.

Theorists have tried to improve the time needed for classical exact edit distance. It remains at quadratic time, however, and there is evidence of inability to do better that we have remarked as puzzling. One of the strange traits of theorists is that when something is hard we try to exploit the hardness.

When life gives you lemons, make lemonade.

For instance, factoring integers is hard—so let us make crypto systems. The lemonade in the case of edit distance is a connection to the hardness of really-hard problems. To quote the seminal paper by Arturs Backurs and Piotr Indyk:

In this paper we provide evidence that the near-quadratic running time bounds known for the problem of computing edit distance might be tight. Specifically, we show that, if the edit distance can be computed in time ${O(n^{2-\epsilon})}$ for some constant ${\epsilon > 0}$ , then the satisfiability of conjunctive normal form formulas with ${N}$ variables and ${M}$ clauses can be solved in time subexp. The latter result would violate the Strong Exponential Time Hypothesis, which postulates that such algorithms do not exist.

Edit Distance—Searching

An open problem that is closer to databases is not computing edit distance. Instead, the problem how to search for strings that are near a given string. This can be a generic use of an edit-distance function. A larger question is whether this can be approached more directly.

I believe that while edit distance may indeed be quadratic time, it still is open how fast searches can be done for approximate matches. The problem is given a large collection of strings ${S}$ how fast can we find the closest string in the collection to some word ${w}$ ? That the best algorithms for edit distance are probably quadratic does not—I believe—entail that this search question is difficult.

Open Problems

By the way, see this for other open problems in databases.

[deleted stray line]