Things that count

Differential privacy is a pretty solid way to count things, and something that I kinda like talking about, so I thought I would walk through some of the ways that you can count things using differential privacy. This should connect up part of the previous post (which touched on counting queries in TPC-H) and a recent request to offer up some thoughts about getting distributional information out of users interactions with lobste.rs .

The specific sort of problem we are going to look at is:

Given a collection of (user, page) records, produce the distributions of multiplicities for each user and each page .

The idea is that some users may show up in the dataset more than others, and that some pages are likely much more popular than others (or at least, show up more often). We would like to dig out that information while providing some (differential) privacy guarantee for the data.

As part of doing this, we are also going to learn how to count the numbers of distinct users and pages, and pull out the maximum multiplicities for each user and page, as part of ultimately just producing the distribution of multiplicities (how many things occur each number of times?) for users and pages.

While you might have counted things with differential privacy before, I hope that parts of this post are new to you. The ideas have been previously published (first in Boosting the Accuracy of Differentially-Private Histograms Through Consistency , and then improved (in my mind) in the wPINQ paper ), so it is possible that you know what we'll talk through. If nothing else, we will explain the 334x improvement of wPINQ over a recent VLDB paper when it comes to the counting in TPC-H query 13, so I suspect this should be new for at least some people.

Differential privacy, and counting!

Differential privacy is a property of randomized algorithms that act on datasets, and which roughly guarantees that slight changes in the input dataset manifest as only slight relative changes in the probability of seeing any given output. This has appealing privacy interpretations, mainly that if you made any slight contribution to the input dataset it will not result in a substantial change in the output (or the output's ability to inform others about your data).

One especially appealing aspect of differential privacy is that its guarantees hold in the presence of arbitrary auxiliary information, meaning that your slight contribution is masked even to people who may have a lot of additional information about you and perhaps the rest of the dataset. This is a real concern in the target application,

77bQB3J.png!web

because our intended data analysis is against information that is probably at least partially (though non-obviously) public.

The prototypical differential privacy mechanism is the Laplace mechanism , which computes a sum across the records in the input dataset and adds noise drawn from the Laplace distribution . If the scale of the distribution is calibrated to the maximum value any one record could possibly contribute to the sum, you get a differential privacy guarantee out the other end.

One easy way to make sure that each record can contribute only a bounded amount is to perform counting queries , where each record contributes either 0 or 1, depending on whether the record satisfies a predicate or not (usually the other way around, though). No matter how weird and sensitive your record, it could only possibly change a count by at most one, and so we only need to add Laplace noise of a bounded scale in order to give everyone some amount of differential privacy.

Composition: bad news!

While counting things is certainly fun (right?), we have to show some restraint. Each time we perform a counting query we may reveal a bit more about the input dataset, and generally the quantitative differential privacy guarantees degrade additively . This means that if we issue five counting queries, our privacy guarantees probably just got worse by a factor of five.

For better or for worse, this is probably the "right answer" when we use differential privacy, and for the present discussion we will try to think about how to be more clever, and get more out of our dataset with fewer queries.

Histogramming queries

We can generalize counting queries just a bit, and introduce a substantial amount of functionality.

Let's say that instead of performing a sum over all records, where each record contributes some variable amount, instead we compute many simultaneous sums where each record can choose to contribute to at most one sum. Just as with the counts, we release each of the sums after adding independent scaled Laplace noise.

For example, rather than perform a counting query as a sum of values 0 and 1, we can have each record contribute one to either the true count or the false count. The result for true is exactly the sum we would have received before (number of satisfying records, plus noise), but now we also get a count of the number of falsifying records, which gives us a better sense for the total number of records (and the relative fraction that satisfy the predicate).

These "simultaneous sums" queries are commonly called "histogram queries", because it is like we are filling out the counts for a histogram of the data, apparently. Their important feature is that because each record can contribute to at most one count, even though we don't know which one a priori , we still only need to add the same amount of Laplace noise as if doing a single summation.

Histogram queries are pretty powerful, to the extent that they are essentially the only differential privacy measurement that wPINQ provides. That's pretty handy, because it means that rather than understand all the weird computations we might do, we can just understand one of them really well.

Caveats: There are some important caveats for histogram queries. Mainly, although the histogram query may produce many, many different sums, it should not reveal which of these sums are in fact zero.

It is very easy to do this accidentally. For example, if we scan through a collection of records and count the number of people with each age (in years), and then noise and release the sums for those ages we saw, we are disclosing the set of ages seen in the dataset. If we didn't happen to see anyone with age 37, we shouldn't disclose this by failing to report a count. Many people have gotten this wrong.

A histogram query should either compute and returned noised sums for all possible histogram bins, determined before performing the query, or it should produce no sums and instead support interactive access where a user specifies the histogram bins of interest. For example, wPINQ produces no counts, and instead insists that a user explicitly ask for counts of interest.

Histogramming in action

Let's try out some histogram queries!

Imagine we have a dataset of pairs (user, page) with the implication that user viewed page . Perhaps both user and page are strings, maybe user names and some sort of URL?

We could perform a histogram measurement that asks "for each page, how many users viewed the page?" just by having each (user, page) contribute to the count for page . That's totally fine, but remember that we can't reveal which pages actually exist. If someone has a page in mind (perhaps they also viewed it, or perhaps it is publicly listed somewhere) they can ask for the noised count for the page, but the measurement itself cannot directly disclose which pages have any views.

This may be a bit abstract, so let's look at some code instead. This is from the Rust port of wPINQ on timely dataflow , whose existence is apparently now a public thing.

let mut page_views =
worker.dataflow(|scope| {
    data.enter(scope)
        .map(|(user, page)| page)
        .measure(&mut probe, &total)
});

There are probably many things here that don't make a lot of sense, even if you are familiar with timely dataflow. For the moment let's imagine that data is our protected collection of (user, page) views, and that probe and total are random diagnostic things that we needn't worry about yet.

What the code up above does is frame a new query (in the context of a "dataflow") that takes a protected collection data and using the map method extracts just page from each (user, page) pair. We are essentially discarding the user field, which suits us just fine if we only want the counts for each page. We then measure the result, which is wPINQ's way of saying "histogram query", and return the measurement as page_views .

At this point we could ask about the counts for various pages. For example, we might write:

println!("{:?}", page_views.observe("Page 1"));
println!("{:?}", page_views.observe("Page 2"));
println!("{:?}", page_views.observe("Page 7"));

and we would get as output three noise counts about whatever the sums were for these pages. The pages might not exist (they do have pretty silly names), in which case the measurement would be only noise. The measurement structure is pretty smart, and if you ask for the same count a second time it will return the original measurement (so you can stop writing that very clever email now).

A privacy digression

You might have noticed that we can do exactly the same thing users as we have done for pages: ask how many distinct pages each user has viewed. That sounds like it could be a privacy violation, so we should talk that out.

Differential privacy masks the presence or absence of records in the input collection, and doesn't provide any direct guarantees for entities that may be reflected in multiple records. In our running example, differential privacy masks the presence of any (user, page) pair, but it is not obliged to mask the presence or absence of all (user, _) pairs.

This subtlety of differential privacy is important to understand, and there is some work to do to ensure that the guarantees it provides corresponds to the guarantees you (as data curator) would like to provide.

If you wanted to protect the full history of a user, for example, you might want to re-frame your dataset as pairs (user, pages) , where the second element of the tuple is a list of page identifiers. Differential privacy would now mask the presence or absence of entire user histories, but at the same time it is now harder to get meaningful counts as each user can contribute to at most one of the page counts, despite having perhaps many pages in their history.

There is often a trade-off between privacy and utility, and this example demonstrates one way that it can show up. Our goal as privacy researchers is to invent mechanisms that make the trade-off less painful.

For the rest of this post, we will use (user, page) records, masking only individual user views rather than entire user histories.

Improving histogram queries

By themselves, histogram queries have somewhat limited utility. Up above we could probably perform histogram queries for page and user . With a bit more creativity we could probably also put together some more queries for functions of page and user , but it isn't clear that this is going to get us to the most interesting of queries.

The approach that PINQ and wPINQ propose is to support transformations of datasets before histogramming. Under some technical conditions (see: "transformation stability") we can guarantee that histogram measurements made of transformed data still provide differential privacy for the pre-transformed data.

We have already seen one transformation: map . It turns out that if you apply a stateless, functional, side-effect free operation to each record independently, it is fine to histogram up the results. We used that up above without much thought, but it actually has mathematics supporting it. There are a few other relatively simple transformations, like the filter transformation that discards some records, but the most exciting ones are more interesting than they are simple.

The `shave` transformation

The transformation we are going to be the most interested in is called shave , and it is a bit weird.

For each value in your dataset, let #(value) be the number of records assuming that value in your dataset. The shave operator produces as output the records

(value, 1)
(value, 2)
..
(value, #(value))

In effect, the shave operator takes our #(value) identical records and distinguishes them with an increasing integer identifier. The records were identical before, so there isn't a worrisome question of which input record gets which identifier; we've just changed the dataset to have a richer set of different values.

The shave operator is like map , in that we can just apply it and histogram the result; a single change it its input results in a single change in its output.

So what is so useful about adding an increasing integer, you might ask?

Take a moment and think through the following fairly simple modification to our histogram query from above (two new lines, just before measure ):

let mut who_am_i =
worker.dataflow(|scope| {
    views
        .enter(scope)
        .map(|(user, page)| page)
        .shave()
        .map(|(page, index)| index)
        .measure(&mut probe, &total)
});

This computaton produces who_am_i , which measures a collection of integers, right? We shave our collection of page views, where the count for each page was the number of times it is viewed, and so each page contributes index values from one up through its count.

The measurement for each index is the number of pages with at least index views.

If we map out the measurements for each index from one up to .. thrice, or whatever we want, we map out the cumulative density function for the numbers of views of pages:

let thrice = 1000;
for index in 0 .. thrice {
    println!("{}:\t{}", index, who_am_i.observe(index));
}

I've taken a small graph I happen to have, mico.txt , and I'm treating its edges as (user, page) pairs (even though both are node identifiers). Performing this same measurement on this graph data reveals the distribution of numbers of incoming edges, also called the "in-degree distribution".

Let's take a look (using epsilon = 0.1 differential privacy):

bMRbQvA.png!web

This picture is a bit of a mess. What I've plotted is 1,000 noise measurements corresponding to each of those multiplicities we measured up above, as well as the "truth" as a solid line behind the measurements. It is a bit hard to see the solid line other than in the tail where the noise is pretty crazy.

Although you can't really see it well in the figure, we've really nailed the high-count measurements. Let's walk through multiplicities one through ten, and compare the noisy measurement and the truth:

multiplicity noisy truth error 1 91361.78 91364 -2.22 2 81131.25 81131 0.25 3 74550.49 74551 -0.51 4 70420.30 70422 -1.70 5 67574.14 67568 6.14 6 64799.66 64783 16.66 7 61348.12 61346 2.12 8 57129.93 57113 16.93 9 51888.44 51882 6.44

It looks like we have some 90,000 "pages" (plus or minus) with at least one view. That is a pretty good way to measure the number of distinct pages, right? But, we don't have to stop there; we keep measuring the numbers of nodes with each multiplicity, out to 1,000.

From the figure, it kinda looks like things fizzle out before 1,000, which still leaves us with some questions about, for example, what the maximum in-degree is (plus or minus). Also, all the measurements out in the tail look pretty horrible, which is partly due to plotting small numbers perturbed by additive noise on a logarithmic scale (all the points at the bottom are the negative counts).

Can we perhaps find a way to improve the accuracy out in the tail?

One good shave deserves another

That's probably not actually a saying. In fact, one imagines that the singular property of a good shave is that you probably don't immediately require another.

But, let's just do that, because .. well I really can't think of a good way to motivate what we are about to do:

let mut wtf_am_i =
worker.dataflow(|scope| {
    views
        .enter(scope)
        .map(|(user, page)| page)
        .shave()
        .map(|(page, index)| index)
        .shave()
        .map(|(index, count)| count)
        .measure(&mut probe, &total)
});

I literally just copy and pasted those two shave and map lines (and then changed the variable names to avoid confusing you all).

So, new puzzle: what does this measure? This is a lot less obvious, so let's talk it through.

First, let's agree that we got to a point where we had a collection of index integers each with multiplicity equal to the number of pages with that multiplicity. This is just before the new shave and map. Now, shaving these again, the value one gets produced by each index with at least one element. That is, for each of the multiplicities with at least one representative article, we get a one element.

If you scratch your head for long enough you'll come around to this being the maximum frequency. It is the count of the article with the largest count.

Likewise, the number two occurs exactly as many times as the second-most-common article. The number three occurs exactly as many times as the third-most-common article, and so on.

We have mapped out the vector of frequencies, without having to know the names of the pages to ask about them directly. When we add noise to these measurement, we are getting noisy measurements of the non-increasing sequence of multiplicities.

Unfortunately this vector of frequencies has 90,000 elements in it, so I don't actually want to plot it at you. We'll get back to that in just a moment.

How about thrice?

Shaving a third time? That makes no sense. We aren't going to do silly things here, sorry.

An explanation

If this all seems a bit mysterious, let's take a visual walk through what the shave operator is doing. Let's think of records as building blocks, perhaps stamped with a symbol indicating their value. We can stack blocks with the same value, and their height will be their multiplicity in the collection.

zaIzeqR.png!web

The shave operator takes each stack of blocks and re-stamps each block with its height in the stack.

bEFFfuB.png!web

If we retain only the height of each block, and discard the original value, all blocks in a row get the same value. If we restack the blocks by value, we get a new stack of blocks like so:

3yyyqaF.png!web

Notice that I have sneakily swapped the colors here, so that the green blocks come second rather than the blue blocks. This was done so that it is painfully obvious that if we do this again, that is relabel each row by its height, then we are just recovering the original colors (or a permutation of them).

We could repeat this process if we want, relabeling and restacking, but there is an easier way to do it: we can just rotate our stack blocks, along the y = x diagonal line, so that the rows of common values become columns of common values.

Our measure histogram query reports (with noise) the heights of each of the columns (e.g. counts of "A", "B", "C", and "D"). The first shave takes a collection of columns (pages) and rotates them to tell us about the culumative frequencies (e.g. the counts of "1", "2", and "3"). The second shave takes these frequencies and rotates them to tell us about the (decreasing) sequence of frequencies (e.g counts of pink, green, blue, and orange). A third shave would just spin them back around to be the cumulative frequencies again, which we have already had the chance to measure.

That's why shaving three times is silly.

Integrating information

So we have maybe 1,000 counts from our first measurement, and some 90,000 counts from our second measurement (pop quiz: where did those upper limits come from? answer: each from the first measurement of the other).

It turns out through the wonder of mathematics, we can integrate the two measurements we have taken and produce a result that is accurate both for low multiplicities (those 90,000 folks with multiplicity one) and for high multiplicities (those few pages with multiplicities in the 600s).

It is really very cool, and there are details in the wPINQ paper (or its workshop predecessor), and I'll just leave you with a hint that the "truth" we are hoping to reconstruct is a path from (0, infinity) to (infinity, 0) that moves down and to the right along grid lines. Moreover, if we want to minimize the sum of the errors between path and observed measurements, it is just a shortest path problem.

Here are real (in purple) and synthesized (in black) cumulative frequencies for various page view multiplicities, taken with 0.2-differential privacy (each of the measurements was done with 0.1-differential privacy, and the privacy degradation accumulates addivitely).

fii2euq.png!web

It's hard to see exactly where things are best or worst, but we can compute the total error as the volume between the curves: 2,117 distributed across either 650 or 90,000-ish measurements, depending on how you look at it.

Cumulative commentary

Differential privacy, and especially the shave operator from wPINQ, are pretty great at producing cumulative frequency plots with privacy guarantees. The distributional information we get out gives a great head-start on synthesizing representative datasets (another thing wPINQ aims to do).

Things that count