A lightweight alternative to Amundsen for your dbt project

A lightweight alternative to Amundsen for your dbt project

Data

If you've been using dbt for a little while, chances are your project has more than 50 models. Chances are more than 10 people are building dashboards based on those models.

In the best case, self-service analytics users are coming to you with repeting questions about what model to use when. In the worst case, they are taking business decisions using the wrong model.

In this post, I will show you how you can build a lightweight metadata search engine on top of your dbt metadata to answer all these questions. I hope to show you that data governance, data lineage, and data discovery don't need to be complicated topics and that you can get started today on those roadmaps with my lightweight open source solution.

LIVE DEMO: https://dbt-metadata-utils.guitton.co

Data Governance is Ripe

In his recent post The modern data stack: past, present, and future, Tristan Handy - the CEO of Fishtown Analytics (the company behind dbt) - was writing:

Governance is a product area whose time has come. This product category encompasses a broad range of use cases, including discovery of data assets, viewing lineage information, and just generally providing data consumers with the context needed to navigate the sprawling data footprints inside of data-forward organizations. This problem has only been made more painful by the modern data stack to-date, since it has become increasingly easy to ingest, model, and analyze more data.

He later also points out that dbt has its own lightweight governance interface: dbt Docs. They are a great starting point and might be enough for a while. However, as time goes by, your dbt project will outgrow its clothes. The search in dbt Docs is Regex only, and you might find its relevancy going down with a growing number of models. This can become important for Data Analysts building dashboards and looking for the right model but also for Data Engineers looking to "pull the thread" when debugging a model. Those use cases can be summarised with the two following "Jobs to be done":

Jobs to be Done
Data discovery can solve 2 'Jobs to be Done'
  1. When I want to build a dashboard, but I don’t know which table to use, help me search through the available models, so I can be confident in my conclusions.
  2. When I am debugging a data model, but I don’t know where to start, help me get data engineering context, so I can be faster to a solution.

These days, the solution to those two problems seems to be rolling out "heavyweight" tools like Amundsen. As Paco Nathan writes p.115 of the book Data Teams by Jesse Anderson (you can find my review of the book here):

If you look across Uber, Lyft, Netflix, LinkedIn, Stitch Fix, and other firms roughly in that level of maturity, they each have an open source project regarding a knowledge graph of metadata about dataset usage -- Amundsen, Data Hub, Marquez and so on. [...] Once an organization began to leverage those knowledge graphs, they gained much more than just lineage information. They began to recognize the business process pathways from data collection through data management and into revenue bearing use cases.

Amundsen logo

Amundsen and other heavyweight tools are the go-to solution for data discovery

Those tools come on top of an already complex stack of tools that data teams need to operate. What if we wanted a lightweight solution instead, like dbt Docs?

More on metadata engines

Related tools:

Other products:

  • Collibra
  • Alation
  • Intermix
  • data.world
  • montecarlodata.com

Great resources to go further:

More on self-service analytics

I see the word Data Democratization employed more and more often mentioned. I find the most tangible application is derived by the gitlab data team -- Stephan Durry, Head of Data & Insights at OneFootball

Self-Service Data | GitLab

The Features of Amundsen and other Metadata Engines

In his great Teardown of Data Discovery Platforms, Eugene Yan summarizes really well the features of Amundsen and other metadata engines. He splits them in 3 categories: features to find data, features to understand data and features to use data.

Amundsen logo

Architecture of your friendly neighbourhood metadata engine

Its friendly UI with a familiar search UX is one of the key factors behind Amundsen's success. But another one is its modular architecture, which is already being reused by other metadata open source projects like the project whale (previously called metaframe).

We can further split the 3 categories of features into 10 features of varying implementation difficulty. Those features have also varying returns, not represented here.

Amundsen features

Taxonomy of 10 features from metadata engines, cost opinions are my own

The key thing to realise is that Lyft might have spent a 15⭐️-cost on Amundsen to assemble all those features. But what if we wanted to build a 3⭐️-cost metadata engine? What features and technologies would you pick?

2021-02-09 Update: The Ground paper from Rise labs

In the seminal paper Ground: A Data Context Service - RISE Lab, Rise labs have outlined those features with much better terminology that I wasn't aware of at the time of first writing this post: the ABC of Metadata

  • Application - information on how the data should be interpreted
  • Behavior - information on how the data is created and used
  • Change - information on the frequency and types of updates to the data

A Lightweight Alternative to Amundsen

Although it's possible that the feature completeness (everything is in one place) makes the USP of Amundsen and others, I want to make the case for a more lightweight approach.

Documentation tools go stale easily. Or at least in situations where they are not tied with data modeling code. dbt has proven with dbt Docs that data people want to document their code (hi team 😁). We were just waiting for a tool simple and integrated enough for the culture of Data Governance to blossom. It reminds me of those DevOps books showing that the solution is not the tooling but rather the culture (if you're curious check out The Phoenix Project).

Additionally, dbt sources are a great way to make raw data explicitly labeled. The dbt graph documents data lineage for you at the table level and I will leverage later that graph to propagate tags with no additional work.

In other words, with schemas, descriptions and data lineage, dbt Docs covers the category

Features to Understand

from the above diagram. So what is missing from dbt Docs to rival with Amundsen? Only a way to sublime the work that is already happening in your dbt repository. And that is Search.

Algolia logo

Algolia market themselves as a 'flexible search platform'

A good search engine will cover the Features to Find category. Fortunately, we don't need to build a search engine. This is where we will use Algolia's free tier in addition to some static HTML and JS files to build our lightweight data discovery and metadata engine. Algolia's free tier allows you for 10k search requests and 10k records per month. Given that for us 1 record = 1 dbt model, and 1 search request = 1 data request from a user, my guess is that the free tier will cover our needs for a while.

Note: if you're worried that Algolia isn't open source, consider using the project typesense.

How to get at least one feature in the Features to Use category? Well, a dbt project is tracked in version control, so by parsing git's metadata, we can for example know each model's owner.

More generally, to extend our lightweight metadata engine, we would add metadata sources and develop parsers to collect and organise that metadata. We would then index that metadata in our search engine. Examples of metadata sources are:

What does good Search look like

Search is going to be key if our metadata engine is to rival with Amundsen, so let's look at Amundsen's docs. We know from their architecture page that they use ElasticSearch under the hood. And we can also read that we will need a ranking mechanism to order our dbt models by relevancy:

Search for data within your organization by a simple text search. A PageRank-inspired search algorithm recommends results based on names, descriptions, tags, and querying/viewing activity on the table/dashboard. -- Source

A bit further in the docs, we learn that Amundsen has three search indices and that the search bar uses multi-index search against those indices:

the users could search for any random information in the search bar. In the backend, the search system will use the same query term from users and search across three different entities (tables, people, and dashboards) and return the results with the highest ranking. -- Source

We even get examples for searchable attributes for the documents in the tables index:

For Table search, it will search across different fields, including table name, schema name, table or column descriptions, tags and etc -- Source

Presumably, there's not much point in reverse engineering an open source project, so I'll spare you the rest: it also supports search-as-you-type and faceted search (applying filters).

More on search

To build this search capability, you could use different technologies. I attended a talk at Europython 2020 from Paolo Melchiorre advocating for using good-old PostgreSQL's full text search. To my knowledge though, you don't get search as you type. This is one of the reasons why people tend to go for ElasticSearch or Algolia. To choose between them, this is then a buy or build decision: more engineering resources vs "throwing money" at the serverless Algolia. As we saw though for our use case, the free tier will be enough so we get the best of both worlds.

Remains the question of structuring our documents for search. Attributes in searchable documents are one of three types: searchable attribute (i.e. matches your query), a faceting attribute (i.e. a filter) or a ranking attribute (i.e. a weight).

Structuring Documents for Search
Keys in searchable documents are 1 of 3 types

Our searchable attributes will be table names and descriptions.

Our faceting attributes will be "tags" on our models: these could be vanilla dbt tags if you have good ones, or materialisation, resource type or any other key from the .yml file. Assuming there is a conscious curation effort happening from the code maintainers when they place a model in a folder in the dbt codebase, we can hence use folder names as a faceting attribute too. Lastly, we can use the dbt graph to propagate from left to right the source that models depend on; this will serve as a useful faceting attribute.

For ranking attributes, we will build metrics important to us to prioritise tables for our users. Keep in mind that we started with 2 use cases ('Jobs to be Done'), so each persona could benefit from a different metric. For example, for "dashboard builders", the goal could be to downrank the corner case models so that only models that are "central" are used. But for "data auditors", the goal might be to prioritise the models that need attention first. In our case, we will focus on the first persona, and we will use a PageRank-like algorithm (degree centrality as shown in my previous post). This is great at the start of your self-service analytics journey: dashboard builders might not know what are the good tables yet, so a good proxy is to look at which models are reused by your dbt comitters. Later, you could do like Amundsen and rely on the query logs to boost the models that are used the most.

Putting it together in the dbt-metadata-utils repository

I have assembled a couple of scripts in the (work in progress) repository called dbt-metadata-utils. I will walk through a couple of key parts here, but feel free to check out the full code there, and if you want to use it on your own project, hit me up.

All you will need is:

  • your already existing dbt project in a git repository locally
  • clone dbt-metadata-utils on the same machine than your dbt project
  • create one Algolia account (and API key)
  • create one Algolia app inside that account
  • run the commands laid out later

For the dbt project, we will use one of the example projects listed on the dbt docs: the jaffle_shop codebase.

Structuring Documents for Search
I had no clue about Jaffles, and then I used dbt

Create an environment file in which you will need to fill in the values from the Algolia dashboard:

# .env file
ALGOLIA_ADMIN_API_KEY=
ALGOLIA_SEARCH_ONLY_API_KEY=
ALGOLIA_APP_ID=

ALGOLIA_INDEX_NAME=jaffle_shop_nodes

DBT_REPO_LOCAL_PATH=~/workspace/jaffle_shop
DBT_MANIFEST_PATH=~/workspace/jaffle_shop/target/manifest.json

GIT_METADATA_CACHE_PATH=data/git_metadata

And then run the 4 make commands:

$ make install  # best is to install inside a virtual environment
pip install --upgrade pip
pip install -r requirements.txt

$ make update-git-metadata
python -m dbt_metadata_utils.git_metadata
100%|███████████████████████████████████████████| 11/11 [00:00<00:00, 12499.96it/s]

$ make update-index
python -m dbt_metadata_utils.algolia

$ make run
cd dbt-search-app && npm start

> dbt-search-app@1.0.0 start /Users/louis.guitton/workspace/dbt-metadata-utils/dbt-search-app
> parcel index.html --port 3000

Server running at https://localhost:3000
✨  Built in 1.03s.

If you navigate to https://localhost:3000, you should see a UI that looks like this:

Search UI on jaffle_shop

I didn't dwell on details, but our metadata engine's features are:

  • search as you type by table name, table descriptions, the model's folder in the dbt codebase, or its sources
  • uses DAG algorithms to propagate tags using the loader and sources keys from the dbt .yml files
  • faceted search by those tags
  • ranking by degree-centrality, and by boosting dbt models that are in a mart or have a docs description
  • enrich the tables documents with git metadata parsed from the git repository using the python git client
  • advanced search using dynamic filtering: if you enter a query with a loader (e.g. "airflow payments"), it will use rules to filter documents with loader=airflow

Conclusion

LIVE DEMO: https://dbt-metadata-utils.guitton.co

There you have it! A lightweight data governance tool on top of dbt artifacts and Algolia. I hope this showed you that data governance doesn't need to be a complicated topic, and that by using a knowledge graph of metadata, you can get a head start on your roadmap.

Leave a star on the github project, and let me know your thoughts on twitter. I enjoyed building this project and writing this post because it lies at the intersection of three of my areas of interest: NLP, Analytics and Engineering. I cover those three topics in other places on my blog.

Resources

  1. The modern data stack: past, present, and future | dbt blog
  2. A Jobs to be Done Framework for Startups — JTBD Templates & Examples for Building Products Customers Want | First Round Review
  3. Louis Guitton’s review of Data Teams: A Unified Management Model for Successful Data-Focused Teams | Goodreads
  4. Teardown: What You Need To Know on Data Discovery Platforms
  5. Architecture - Amundsen
  6. dataframehq/whale: 🐳 The stupidly simple data discovery tool.
  7. How to find and organize your data from the command-line | by Robert Yi | Towards Data Science
  8. The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win by Gene Kim | Goodreads
  9. Site Search & Discovery powered by AI | Algolia
  10. typesense/typesense: Fast, typo tolerant, fuzzy search engine for building delightful search experiences ⚡ 🔍
  11. louisguitton/dbt-metadata-utils: Parse dbt artifacts and search dbt models with Algolia
  12. transform/snowflake-dbt · master · GitLab Data / GitLab Data Team · GitLab