Customer Matching using Google Cloud Dataflow

Summary

This article illustrates how we modernized a client’s data platform by implementing an entity resolution pipeline. We tied disparate data sets together and performed customer matching to create a Golden Customer Record using Google Cloud Dataflow. The solution enabled their Marketing/Analytics teams to derive valuable insights about their customers, make better-informed decisions for marketing campaigns, and explore new ways to improve customer experience/retention.

Preface

What is Entity Resolution?

Entity resolution (ER), otherwise referred to as record linkage or data matching, is the process of disambiguating — identifying, matching, and merging — different manifestations of the same real-world entity across disparate data sources.

For instance, a business that maintains customer information may store multiple data records referring to the same customer across several systems. These records can contain different full names (legal name vs. preferred or maiden name), email addresses (personal vs. business), or phone numbers (primary vs. secondary). Determining how to group each of these individual records that correspond to the same individual is the challenge that entity resolution solves.

Why is Entity Resolution useful?

Technology has revolutionized the way businesses operate. With continuous feeds of data, including online transactions, customer profiles, IoT devices, subscriptions, and many more, enterprises have the ability to capitalize on this abundance of information to gain insights and increase their profitability. These siloed sets of information are often stored in separate data stores, making it difficult to generate connections and obtain a holistic 360° view. Entity resolution is the means through which the disambiguation of this data is achieved.

Real-world scenario: Rick’s Coffee Shop

Let’s use Rick’s coffee shop to illustrate an example. The coffee shop begins with a single brick-and-mortar store where he sells his blend of coffee. As his shop picks up popularity and many repeat customers, a loyalty rewards program gets implemented to incentivize customers to stay loyal and collect points for discounts and other promotions. Customers sign up for the loyalty program by providing their name, email, and phone number in-store. This information, along with subsequent transactions, is logged and stored within the loyalty database.

Soon after, the popularity of Rick’s special blend of coffee justifies a packaged offering that customers can purchase online for those who want to enjoy the coffee from the comfort of their homes. This results in an online store that stores e-commerce user-profiles and payment information in the web platform database.

Social media's importance prompts the shop to strengthen its online presence through various online platforms. This is also an easy way to connect with customers. The user information and activity with the coffee shop’s social media account are stored in the platform’s databases and are accessible through their APIs.

entity-resolution-using-google-cloud-dataflow-7bf29117b5f7?source=collection_home---5------0-----------------------

Digital Manifestations vs. Real-world Entity

In this common example, we can extract the following records that correspond to a single entity/customer:

Payment information from in-store purchases
Loyalty account information
E-commerce customer information
Social media following

The value of joining these records together to create a “single pane of glass” view can transform and increase businesses' longevity. For example, analysts can use insights into consumers' spending behaviour and patterns to segment customers that drive marketing campaigns. While this matching process may seem trivial to humans, this is not the case for machines.

What are the challenges?

Silos

Silos are one of the key challenges of entity resolution. Typically, systems are designed and developed over many years. With newer and ever-changing technologies, the possibility of having many disparate systems is very high. As a result, it is cumbersome to join all of the data across these sources to generate a unified view. However, this may be simplified by the existence of a data lake or warehouse, which acts as a centralized repository.

Definition of “a match”

Assuming you can aggregate the data across disparate systems into a pipeline, determining which fields to join is non-trivial. If a globally unique entity identifier exists across all data sources, the matching process is as simple as joining on that identifier. However, it is improbable that this unique identifier exists in every single source. In the absence of entity identifiers to link the same entity across the disparate systems, it is up to the business to determine which attributes are suitable to match against. Every enterprise has its own definition of what is considered a “match”.

Data Quality

The quality of data can make or break the integrity of matching pipelines. There is no guarantee that one data source attribute will appear exactly the same in another data source due to incomplete information, duplicated entries or a lack of standardization across source systems. This typically results from data entry errors, missing values, inconsistent formatting, a lack of data validation, or changing data. Inaccurate data can yield missed matches (false positives/false negatives).

Scalability

There is a direct correlation between the amount of data a business has and the data pipeline's efficiency. Entity resolution solutions must utilize parallel processing frameworks, such as MapReduce, to conduct the matching process efficiently. Also, it is paramount to ensure that enough workers are present to match the load based on the amount of data being processed.

Customer Matching using Google Cloud Dataflow

Customer Matching using Google Cloud Dataflow

Summary

Preface

What is Entity Resolution?

Why is Entity Resolution useful?

What are the challenges?

Recommend

全球最火的程序员学习路线！

如何让一个vue项目支持多语言（vue-i18n）

视频号直播公会奖励政策（2021年11月）

抖音小店二手数码类目怎么开通？

7张图揭晓RocketMQ存储设计的奥妙

新华网：中国存储市场迎来加速发展新局面

Designing a Programming Language for Advent of Code

I have no idea what I’m doing – Surfing Complexity

csum_partial

The best Cyber Monday 2021 deals and Black Friday offers still live on Shark, Ho...

About Joyk