10

Why Grubhub uses crypto when generating coupon codes at scale

 4 years ago
source link: https://bytes.grubhub.com/why-we-use-crypto-when-generating-coupon-codes-at-scale-44dc737a52c9
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Why Grubhub uses crypto when generating coupon codes at scale

Image for post
Image for post
Photo by Mr Cup / Fabien Barral on Unsplash

With Hung Huynh.

There’s usually no need to get fancy when implementing discount codes for eCommerce cart checkout. You just need a CRUD app where your marketing team can create a random code and specify how much it’s for, how many times it can be used overall and per customer, and other restrictions such as the expiration date. You persist it in a database so you can track when it’s used and enforce the restrictions. You can tie multiple codes to a single marketing campaign by storing metadata and/or giving them a shared pattern. This is the right approach in most cases, and we still use it for small campaigns.

But what happens when your customer base (current or prospective) is in the tens of millions and you want to send each of them a unique code? Creating N coupon codes for your campaign implies writing N rows to a database in real time. It took a while for this to be an issue for us, since we use Cassandra, a write-optimized store. But while Cassandra famously can handle millions of writes per second, “can” and “should” are two different things. Leasing the amount of hardware necessary to get to that level of performance would, well, eat into our discount budget. Sharing the infrastructure we use for our production operation worked, but it meant we risked degrading performance of our customer-facing services — saving random strings was in contention with saving orders and took up a lot of disk space.

Our immediate, but annoying, solution was async and throttling. Instead of clicking a button to generate and download your codes, you clicked a button to enqueue a job to create your codes at a slow trickle. Check back in a few hours, and your codes were done. This removed the performance risk and minimized expense at the cost of crimping our marketers’ style. So while we had a working solution, we started on a more ambitious approach to get rid of that hours-long wait and save some more money.

What if we didn’t write a code to the database until it was used by the customer? While we’d love for 100% of our customers to use our discounts, in practice, far fewer will. Cassandra would only end up having to store a fraction of the rows, and the writes would naturally be distributed over the lifetime of the campaign instead of pouring in all at once at the start. All we’d need to store up front was the campaign-level data. We’d map a code like “GHCA88RPLY47” to the “RPLY47” campaign, which would tell us what discounts and restrictions to apply. Storing the code at usage time meant we could still prevent illicit reuse and be able to monitor the success of our campaign.

Image for post
Image for post

The only problem here is enemy action. While we can generate ten million random strings ending with RPLY47 easily enough, if we don’t store them, we’re left with no way to validate whether a coupon code with that suffix is one we actually sent out or one somebody made up after noticing that we sent out millions of codes with a common pattern. What we needed was a common pattern such that we could verify whether a given string matched, but an adversary couldn’t reverse engineer, even given an arbitrary number of examples of the pattern. We needed crypto.

Two bad ideas…

Specifically, we needed:

A. A compactly-specifiable space of N valid short alphanumeric codes such that

B. Someone with access to arbitrarily many examples of valid codes could not tractably generate more

An example of a scheme that fit the first criterion but failed the second would be:

  1. Define a random range of integers.
  2. Define a fixed suffix string.
  3. Convert the integers into our promo code alphabet (a variant of Crockford’s Base 32 encoding) and append the suffix.

For example, if our random range were the million numbers in [919043528204,919044528204), and our suffix string were “RPLY47”, we’d send out codes from TQXM6NGCRPLY47, TQXM6NGDRPLY47, TQXM6NGERPLY47, up to TQXN562CRPLY47, converting each integer in the range into our alphanumeric encoding. To check if a code was valid, we’d take the part before the suffix, convert it back into an integer, and check if it fell in the range stored with the suffix. Easy for us, but also far too easy to mostly reverse engineer — an attacker with just two codes can easily guess the pattern. The only aspect not easy to deduce is the exact boundaries of the range, but the attacker just needs values in the middle.

An example of a scheme that fits the second criterion but fails the first would be to define a random range of integers and a PGP key-pair, so that each coupon code would be an encrypted copy of an integer in the range. This doesn’t work because the output is too long — a typical PGP signature looks like “8oqERAqA2AJ91Tx4RziVzY4eR4Ms4MFsKAMqOoQCgg7y6
e5AJIRuLUIUikjNWQIW63QE”. While most of our customers are just copy-pasting anyway, in principle, our codes do need to be something you can shout across the room at a party.

…Equal one good one

There’s no fundamental reason, though, why an encrypted code has to be longer than a random one. What we needed was a reversible, cryptographically-secure pseudorandom permutation of numbers in a large range. In other words, an operation that met three criteria. It couldn’t make a number larger on average. It needed to be reversible if you know a secret. And, critically, if you didn’t know the secret, it couldn’t be reversed even if you had arbitrary examples of the permutation.

Luckily, this is 2019 and crypto is a mature field, so this was more a literature search problem than a need for original research. It turns out a stronger property has already been proven of the iterated Feistel block cipher, which is a relatively simple (and computationally cheap) algorithm. To encrypt a number, we split it bitwise into two pieces, hash one half with a secret key, xor the two, concatenate the other half, then repeat a few more times with different keys. Even though we’re using a one-way hash to incorporate the keys, we can still reverse the encryption just by reversing the key order because we’ve only encrypted half of the text in each step.

So our actual scheme goes, more or less:

  1. Choose or generate a unique suffix string for the campaign.
  2. Generate a random contiguous range of integers in (0..10⁶⁰) and a random compound secret key. Save them under the suffix string.
  3. Generate codes by encrypting each number in the range iterating over the key elements, then converting the result into an alphanumeric string with the suffix.
  4. Validate codes by converting them back into integers, decrypting them, and then verifying that they fall within the random range and have not previously been used.

Proof of security

To prove this resists practical attacks, we only¹ need the venerable finding of Luby and Rackoff that a Feistel block cipher (iterated four or more times) provides a permutation that appears random even to an attacker who can see the outputs of a polynomial number of chosen inputs. This is more information than an attacker is likely to have — it implies that the attacker has somehow inferred the random range andis able to see not just many examples of codes, but the order in which those codes were generated. But assume for simplicity that the attacker has this information. If they were then able to generate a valid code outside their sample set, that would give them probabilistic knowledge of how any given number in the range would be encrypted. Specifically, the odds of such a number encrypting to the code would be 1/(size of the range), whereas in a truly random permutation, the odds would be 1/(max integer). Thus the permutation would no longer appear random, contradicting Luby-Rackoff.

Coupon codes today

With our current implementation, defining a block of ten million codes is more or less instantaneous, and it takes about ten minutes to “page through” the range, generating the code for each integer in memory and uploading it to a distribution system. This empowers quick turnaround on any marketing idea that involves giving away massive amounts of free food. And really, isn’t that the best kind of idea?

[1] Stronger findings, e.g. https://www.iacr.org/archive/crypto2003/27290510/27290510.pdf exist for more iterations.


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK