BigQuery and surrogate keys: a practical approach

2018-10-02adminGoogleCloudNo comments

Source: BigQuery and surrogate keys: a practical approach from Google Cloud

When working with tables in data warehouse environments, it is fairly common to come across a situation in which you need to generate surrogate keys. A surrogate key is a system-generated identifier that uniquely identifies a record within a table.

Why do we need to use surrogate keys? Quite simply: contrary to natural keys, they persist over time (i.e. they are not tied to any business meaning) and they allow for unlimited values. Think about a table that is collecting IoT data from multiple devices in different regions: since you want to store unlimited data and you could have possible overlaps on device identifiers, surrogate keys will help a lot to uniquely identify a record. Another use case could be an “items” table in which you want to store information not only related to articles but also history of changes made on data: even in this case surrogate keys could be an elegant solution to easily assign unique identifiers to all records.

How to generate surrogate keys

A common way to generate surrogate keys is to assign an incremental number to each row of a table: you can accomplish this goal using the standard SQL function ROW_NUMBER() OVER({window_name | (window_specification)})

BigQuery surrogate key generation example

Unfortunately, this approach is limited. To implement ROW_NUMBER(), BigQuery needs to sort values at the root node of the execution tree, which is limited by the amount of memory in one execution node.

The general solution is to partition the data using PARTITION predicate combined with the partition fields to get a unique id per row.

An alternative approach

Another solution that you might adopt—especially in streaming scenarios—depends on hashing functions. Since this pattern computes the surrogate key values at run time, it can be applied to data sets of any dimension. The idea is to compute the SHA256 digest of selected record fields and use the result as surrogate key.

Two possible approaches are possible:

INSERT INTO statement when new records have to be inserted in the table (play with inner query to select target data to be inserted into target table)
INSERT INTO `MyPrj.MyDataset.Wiki100B_With_SK`
(SurrogateKey,year,month,day,wikimedia_project,language,title,views)
SELECT (SHA256(title)) AS SurrogateKey,*
FROM `bigquery-samples.wikipedia_benchmark.Wiki100B`

UPDATE statement when existing records have to be updated (play with filters to select correct data to be updated)
UPDATE `MyPrj.MyDataset.Wiki100B_With_SK`
SET SurrogateKey = (SHA256(title))
WHERE year = 2010

When you plan your activities, please note that you are limited to 1000 INSERT operation per table per day and 200 UPDATE operation per table per day.

In case of duplicate records in the source table, a random value can be concatenated before computing the digest: this will reduce the probability of encountering collisions. If you’d like to stay up to date with future additions, consider bookmarking the release notes page.

Surrogate keys with generate UUID function

Another way to achieve the same result is to use a new function called GENERATE_UUID() that generates a unique identifier for each row of the query result at run-time. The generated key will be a lowercase String consisting of 32 hexadecimal digits in five groups separated by hyphens in the form 8-4-4-4-12.

UUID generation in BigQuery on Wikipedia example

Conclusion

Surrogate keys are common in data warehouse environments because they are:

Context-free
More future-compatible
Potentially infinitely scalable

BigQuery provides end users with the ability to easily deal with surrogate keys, enabling their generation and update at scale.

We hope you enjoyed learning some possible patterns to tackle surrogate key management; the engineering team is continuously improving the service by adding new capabilities or extending the ones already available. If you’d like to stay up to date with future additions, consider bookmarking the release notes page.

BigQuery and surrogate keys: a practical approach

BigQuery and surrogate keys: a practical approach

How to generate surrogate keys

An alternative approach

Conclusion

Recommend

2018诺贝尔生理学或医学奖揭晓：两位免疫学家获奖

尝鲜的代价大苹果iPhone XS Max首批问题汇总

陈列平与诺奖失之交臂，但他的贡献远比诺奖重要

传华硕计划分拆智能手机业务本月将裁员上百人

一个有用的产品路线图应该是什么样的？

GitHub - kylestetz/slang: ? a simple audio programming language implemented in J...

@渡边鱼子酱：今天国庆家里来亲戚了，然后我叫我妈给亲戚表演一段她在老年大学学的电...

【小肚腩成了表情包后被商用，她“一怒之下”申请产权：毕竟我也是花了很多钱吃大的】网...

美国三线城市的崩溃

马斯克花4000万美元买了一条教训

About Joyk