Lyft Case Study

Lyft Speeds Up Data Discovery with Tool Using Neo4j

The Challenge

Data is at the heart of every decision at Lyft. Once decisions are made, their impact is evaluated using data.

Given the vital role of data and analytics across the company, the speed with which users can find data, understand it, analyze it and gain insights is critical.

Data discovery – finding the right data and understanding it – was slow and inefficient. Tables might have similar names, like driver_rides_completed and rides_driver_total.lifetime_completed. Users asked coworkers for help, reached out on Slack channels or looked at Github to see how a table was generated. They often pulled the first 100 rows to get a feel for the contents.

Lyft’s growth exacerbated the challenge of data discovery. Lyft already had about 10 petabytes in thousands of tables across a variety of different data stores according to Tamika Tannis, a Lyft software engineer. Growth meant even more data generated by the mobile app and other services. As new talent was hired, the number of users doing data discovery also grew.

Lyft needed a better way to support data discovery for everyone in the company. To quantify the problem and get a baseline, Tannis’s team looked at the impact on data scientists and found that data discovery consumed about a third of their time.

The Solution

Lyft engineers decided to build a tool to simplify data discovery. Their first target audience would be the most frequent users of data: analysts and data scientists.

Named Amundsen, the tool would offer three complementary ways to do data discovery: search-based, lineage-based and network-based.

An effective search was a top priority, ranking results by popularity and relevance. Lineage-based discovery traces connections among datasets. Network-based data discovery connects data with people, particularly valuable for new team members.

“You might want to see what data resources your manager or your coworkers are using so you can use trusted data resources that everyone else is already using for similar purposes,” said Tannis.

Amundsen uses a microservice architecture. The Databuilder service ingests data into the search service, which is backed by Elasticsearch, and the metadata service, which is run by the Neo4j graph database. Elasticsearch powers the search by providing relevance based on search terms, the user’s position in the company and the popularity of the tables. All of those connections are first made in Neo4j.

Lyft chose Neo4j because it captures the shape of their data ecosystem, which is naturally expressed as a graph. The flexibility of Neo4j is very beneficial when it comes to iterating quickly on new features.

“When we have a new use case and a new piece of metadata to represent, we just have to create a new node and create that relationship,” said Tannis.

At Lyft, Neo4j is an important component of Amundsen’s architecture; it serves as the source of truth for editable metadata. Neo4j also provides a foundation for new projects like compliance and data quality. “The future, as I see it, is that we’ve got a full-fledged metadata repository on which we’re building many applications,” said Mark Grover, a product manager at Lyft.

Download Case Study

Lyft Speeds Up Data Discovery with Tool Using Neo4j

The Challenge

The Solution

Recommend

在 Windows 环境下使用 VSCode 和 Go 语言开发 STM32

优秀保险配置的必备险种

Report: 36% of security pros have adopted hardware-assisted cybersecurity

一文带你深入了解抖音的上瘾机制

如何在 Markdown 中修改字体颜色

上海抗疫 | 上海市秋季高考统考延期至7月7日-9日举行，中考延期至7月11日-12日举行

Instagram计划集成基于以太坊、Polygon、Solana和Flow的NFT

缘起缘灭，蒙牛、达能各自欢喜

The future of on-prem and the cloud

全域增长 | 万字梳理品牌京东自营增长宝典

About Joyk