24

Using Mode definitions (and dbt) to bootstrap your data model

 6 years ago
source link: https://www.tuicool.com/articles/hit/YF3qIj3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Using Mode definitions (and dbt) to bootstrap your data model

qAVvIja.png!webFRba6nN.png!web
Penny model. Oh, if our network graphs were so simple.

At Landed, we track a variety of customer journeys through a variety of lifecycle stages across a variety of products. To explore all this variety, our teams rely on a collection of analytics dashboards to help them anticipate the needs of our customers and partners. The questions our teams ask change over time, which requires a flexible and responsive practice for building and updating data models. We recently went through a process of upgrading our data infrastructure, and asked ourselves “how might we define an agile sprint that will take us from raw data to a BI Platform MVP in a couple of weeks?”

We knew we were looking for a data modeling solution that would provide version control, a staging environment with the ability to define tests, and a nice community of practice to support and learn from. After shopping around, we settled on two tools that work nicely together as a pair, Mode Analytics and Fishtown Analytics’ data build tool ( dbt ). Mode is a SQL-driven BI Platform that supports R and Python, which means you can augment their internal charting with all your favorite libraries (seaborn, matplotlib, etc.). Dbt is a tool (and perhaps a state of mind?) that allows you to clean, denormalize, test, and pre-aggregate your data.

The goal of this post is to share the broad strokes of how we used these two tools together to get our BI Platform on its feet during a 4-week sprint.

Here’s the general approach we took:

  • STEP 1: Confirm definitions and calculations (prep)
  • STEP 2: Create base tables (dev)
  • STEP 3: Document and define tests in dbt .yaml files (staging)
  • STEP 4: Schedule builds using dbt cloud (prod)
  • STEP 5: Give back to the open-source community (:smile:)
  • Rinse and repeat! (iterate)

STEP 1: Confirm Definitions and Calculations

Following the sage advice of my database elders, I created a catalog of all of our properties as well as a list of the queries that drive our existing reports and current OKR metrics. This documentation became my source of truth during the build. A simple shared document where teams could track progress in real-time.

I was lucky to join Landed after a period of rigorous property definition and data cleaning, so our starting point for this process was a place of clarity and consensus. If you’re bootstrapping a model with less clarity, I encourage you to scale the project back until you arrive at the handful of properties and calculations everyone can agree upon (even if, at first, that’s just first_name and last_name ). As you build your system out and have some success with your early charting projects (“Look, a list of all our customer names!”), this will inspire others to stabilize their logic and clean up their data so they can jump in the game.

STEP 2: Create Base Tables with Mode Definitions

Working from our list of queries I then created base tables for each team. These tables included only the properties needed to drive the reporting; a curated collection of VIP features. I defined these base tables using Mode Definitions, which are select statements you can reference in other queries (up to 10 layers of nesting!). These base tables were defined iteratively — as I wrote each query in Mode, I added the necessary properties I needed from our Redshift cluster.

These base tables ensure our users never query our raw data directly. Even though I can fully ensure that our users don’t have the ability to mutate the raw data (by virtue of our read-only integration with Mode), we still want a clean record of the data types and calculations serving our analytics, which these base tables provide. They are also where we define our aliases for each column, as our ETL from HubSpot to Redshift (which we use Stitch to coordinate) pulls in the internal names of our data properties, which differ from the field names our teams engage with. While most of our team members will not be exploring our SQL queries directly, it’s always a good idea to norm on terminology.

STEP 3: Document tables in dbt .yaml files

This is the part of our journey where dbt comes in. To be honest, it wasn’t entirely clear to me at the onset when we should switch from defining our model in Mode definitions to setting up our tests in dbt and building our materialized views. This is partly because it’s quite easy to use Mode definitions in your Mode reports, and I noticed a bit of inertia when confronted with the hump of transferring things over to dbt. When I remembered dbt’s test coverage tooling, I rallied!

We arrived at an internal rule of thumb which was to start shifting things over from Mode once we’d covered 90% of the queries in our initial plan, which we hit about three weeks in. This part of the process involved defining a .yaml file for each base table that included pithy descriptions for all properties, as well as test definitions, of which there are four flavors in dbt: not_null , unique , relationships , and accepted_values .

Here’s an example:

version: 2

models:
  - name: customer_id
    description: Unique customer identifier 
    columns:
      - name: uid
        tests:
          - not_null
          - unique
          
      - name: best_friend
        description: Customer's BFF
        tests:
          - relationships:
              to: ref('friends')
              field: uid
            
      - name: favorite_ice_cream_flavors
        description: Customer's favorite ice cream flavors
        tests:
          - not_null
          - accepted_values:
              values: ['cherry', 'chocolate']

It’s worth noting that Mode supports persistent derived tables (pdt), which replicate the materialized views functionality provided by dbt. I was drawn to dbt’s active open source community and testing methods, but if you’re already using Mode Analytics for your BI, you may want to look into pdt.

STEP 4: Schedule builds using dbt cloud

This is the stage we’re at now with Landed; scheduling our nightly dbt builds after our ETL integration. Since we integrate data from our CRM every night, we schedule our dbt cloud job to materializes our base table views directly after. This part is pretty straightforward, because by this point all the hard work of building and testing the model is done.

One thing to note, if you’re using Redshift you’ll probably want to use late-binding views so that you don’t hit any snags in your nightly build. Late-binding views bind tables at run-time, so you can drop a build a view that binds to another table without throwing any errors. I hit this when we set up our ETL with Stitch, and switched to late-binding and now everything runs fine.

STEP 5: Give back to the open source community

Whether picking off an issue for dbt or writing up a blog post about what you built, it’s always nice to round out a build cycle by giving some love back to the community. Learning with others is one of the greatest joys of engineering, as is supporting projects that advance best practices in the field. Find a ticket, make some friends through the repo, and submit a PR!

Rinse and Repeat!

That’s it! Now comes endless cycles of iteration :)

Siobhán is the Data Engineering Lead at Landed . Landed is on a mission to help essential professionals build financial security near the communities they serve, and we invest alongside educators when they’re ready to buy homes in expensive markets. Sound awesome? Join us !


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK