How to parse dbt artifacts

Dec 20, 2020

#Data Management
#Analytics

Read time: 5 minutes

Overview🔗

If you're using dbt, chances are you've noticed that it generates and saves one or more artifacts with every invocation.

In this post, I'll show you how to get started with dbt artifacts, and how to parse them to unlock applications valuable to your team and your use case.

Whether that's just for a fun Friday afternoon learning session, or whether that's your first foray at building a Data Governance tool using dbt, I hope you'll find this post useful, and if you do, let me know on twitter!

When are Artifacts Produced🔗

dbt logo

A word of warning: dbt's current minor version as of writing is v0.18.1 and multiple improvements to artifacts are coming in dbt's next version v0.19.0, but that doesn't change the content of this post.

dbt has produced artifacts since the release of dbt-docs in v0.11.0. Starting in dbt v0.19.0, we are committing to a stable and sustainable way of versioning, documenting, and validating dbt artifacts.
Ref: https://next.docs.getdbt.com/reference/artifacts/dbt-artifacts/

The artifacts currently generated are JSON files called manifest.json, catalog.json, run_results.json and sources.json. They are used to power the docs website and other dbt features.

Different dbt commands generate different artifacts, so I've summarised that in the table below:

	compile	run	test	snapshot	seed	docs	source	list
manifest	x	x	x			x		x
run results		x	x	x	x	x
catalog						x
sources							x

Of course, dbt docs is the command that refreshes most artifacts (makes sense, since they were initially introduced to power the docs site). But it's interesting to note that manifest can be refreshed by other commands than the usual suspects dbt run or dbt test too.

Available Data in dbt artifacts🔗

Manifest:

Today, dbt uses this file to populate the docs site, and to perform state comparison. Members of the community have used this file to run checks on how many models have descriptions and tests.

Run Results:

In aggregate, many run_results.json can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.

Catalog:

Today, dbt uses this file to populate metadata, such as column types and table statistics, in the docs site.

Sources:

Today, dbt Cloud uses this file to power its Source Freshness visualization.

graph.gpickle:

Stores the networkx representation of the dbt resource DAG.

Parsing Artifacts from the Command Line with `jq`🔗

jq logo

To get started with parsing dbt artifacts for your own use case, I suggest to use jq, the lightweight and flexible command-line JSON processor. This way, you can try out your ideas, explore the available data without writing much code at first.

jq Cheat sheet:

In particular, you will need to make use of some of the built-in operators like to_entries and map.

Here is a command to grab the materialisation of each model

→ cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})'
[
  {
    "node": "model.jaffle_shop.dim_customers",
    "materialized": "table"
  },
  {
    "node": "model.jaffle_shop.stg_customers",
    "materialized": "view"
  }
]

You can then for example store that into a file by piping the output

cat target/manifest.json | jq '.nodes | ...' > my_data_of_interest.json

Parsing Artifacts from Python with `pydantic`🔗

Once you get a better idea of what data you need, you might want to develop more custom logic around dbt artifacts. This is where python shines: you can write a script with the logic you need. You can install and import great python libraries. For instance, you could use networkx to run graph algorithms on your dbt DAG.

You will then need to parse the dbt artifacts in python. I recommend to use the great pydantic library: among other things, it allows to parse JSON files with very concise code that lets you focus on high-level parsing logic.

Here is an example logic to parse manifest.json:

import json
from typing import Dict, List, Optional
from enum import Enum

from pydantic import BaseModel, validator


class DbtResourceType(str, Enum):
    model = 'model'
    analysis = 'analysis'
    test = 'test'
    operation = 'operation'
    seed = 'seed'
    source = 'source'


class DbtMaterializationType(str, Enum):
    table = 'table'
    view = 'view'
    incremental = 'incremental'
    ephemeral = 'ephemeral'
    seed = 'seed'


class NodeDeps(BaseModel):
    nodes: List[str]


class NodeConfig(BaseModel):
    materialized: Optional[DbtMaterializationType]


class Node(BaseModel):
    unique_id: str
    path: Path
    resource_type: DbtResourceType
    description: str
    depends_on: Optional[NodeDeps]
    config: NodeConfig


class Manifest(BaseModel):
    nodes: Dict["str", Node]
    sources: Dict["str", Node]

    @validator('nodes', 'sources')
    def filter(cls, val):
        return {k: v for k, v in val.items() if v.resource_type.value in ('model', 'seed', 'source')}


if __name__ == "__main__":
    with open("target/manifest.json") as fh:
        data = json.load(fh)

    m = Manifest(**data)

Once you've got the Manifest class, you can use it in your custom logic. For example, in our use case from above where we want to check for model materialization, we can do:

>>> m = Manifest(**data)
>>> [{"node": node, "materialized": n.config.materialized.value} for node, n in m.nodes.items()]
[
  {
    "node": "model.jaffle_shop.dim_customers",
    "materialized": "table"
  },
  {
    "node": "model.jaffle_shop.stg_customers",
    "materialized": "view"
  }
]

Example Application 1: Detecting a Change in Materialization🔗

Let's say you want to check that no materialisation has changed before you run dbt run. This is useful because some materialization changes require a --full-refresh. You could achieve the change detection with the following commands:

→ cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})' > old_state.json
→ # code change: let's say one model materialization is changed from table to view
→ dbt compile
→ cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})' > new_state.json
→ diff old_state.json new_state.json
12c12
<     "materialized": "table"
---
>     "materialized": "view"

Example Application 2: Compute Model Centrality with `networkx`🔗

Once you've parsed the manifest.json, you have at your disposal the graph of models from your project. You could explore off-the-shelf graph algorithms provided by networkx, and see if any of the insights you get are valuable.

For example, nx.degree_centrality can give you the list of models that are "central" to your project. You can use that e.g. to priotise maintenance efforts. In the future, you could imagine a dbt docs search that prioritises results based on this metric as a very simple PageRank proxy.

Once you've written the pydantic code from above, this turns out to be possible in a very small amount of lines.

import networkx as nx

# ... pydantic code from above for Manifest class

class GraphManifest(Manifest):
    @property
    def node_list(self):
        return list(self.nodes.keys()) + list(self.sources.keys())

    @property
    def edge_list(self):
        return [(k, d) for k, v in self.nodes.items() for d in v.depends_on.nodes]

    def build_graph(self) -> nx.Graph:
        G = nx.Graph()
        G.add_nodes_from(self.node_list)
        G.add_edges_from(self.edge_list)
        return G


if __name__ == "__main__":
    with open("target/manifest.json") as fh:
        data = json.load(fh)

    m = GraphManifest(**data)
    G = m.build_graph()
    nx.degree_centrality(G)

Example Application 3: Graph visualisation🔗

Provided you use python 3.8+, there is another dbt artifact that can be interesting to you: graph.gpickle. Instead of parsing manifest.json and building the graph yourself, you can deserialize the networkx graph built by dbt itself.

All it takes is 2 lines!

That's hard to beat, but note that you will rely on the internal graph definition of dbt and won't be able to customise it. For example, tests will be nodes on your graph now.

import networkx as nx

G = nx.read_gpickle("target/graph.gpickle")

Nevertheless, this can be useful for example for a quick visulisation using pyvis:

from pyvis.network import Network

nt = Network("500px", "1000px", notebook=True)
nt.from_nx(G)
nt.show("nx.html")