How to parse dbt artifacts
- #Data
Overviewπ
If you're using dbt
, chances are you've noticed that it generates and saves one or more artifacts with every invocation.
In this post, I'll show you how to get started with dbt artifacts, and how to parse them to unlock applications valuable to your team and your use case.
Whether that's just for a fun Friday afternoon learning session, or whether that's your first foray at building a Data Governance tool using dbt, I hope you'll find this post useful, and if you do, let me know on twitter!
When are Artifacts Producedπ
A word of warning: dbt's current minor version as of writing is v0.18.1 and multiple improvements to artifacts are coming in dbt's next version v0.19.0, but that doesn't change the content of this post.
dbt has produced artifacts since the release of dbt-docs in v0.11.0. Starting in dbt v0.19.0, we are committing to a stable and sustainable way of versioning, documenting, and validating dbt artifacts.
Ref: https://next.docs.getdbt.com/reference/artifacts/dbt-artifacts/
The artifacts currently generated are JSON files called manifest.json
, catalog.json
, run_results.json
and sources.json
. They are used to power the docs website and other dbt features.
Different dbt commands generate different artifacts, so I've summarised that in the table below:
debug | init | compile | run | test | deps | snapshot | clean | seed | docs | source | run-operation | rpc | list | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
manifest | x | x | x | x | x | |||||||||
run results | x | x | x | x | x | |||||||||
catalog | x | |||||||||||||
sources | x |
Of course, dbt docs
is the command that refreshes most artifacts (makes sense, since they were initially introduced to power the docs site). But it's interesting to note that manifest
can be refreshed by other commands than the usual suspects dbt run
or dbt test
too.
Available Data in dbt artifactsπ
Manifest
:
Today, dbt uses this file to populate the docs site, and to perform state comparison. Members of the community have used this file to run checks on how many models have descriptions and tests.
Run Results
:
In aggregate, many run_results.json can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.
Catalog
:
Today, dbt uses this file to populate metadata, such as column types and table statistics, in the docs site.
Sources
:
Today, dbt Cloud uses this file to power its Source Freshness visualization.
graph.gpickle
:
Stores the networkx representation of the dbt resource DAG.
Parsing Artifacts from the Command Line with `jq`
π
To get started with parsing dbt artifacts for your own use case, I suggest to use jq
, the lightweight and flexible command-line JSON processor. This way, you can try out your ideas, explore the available data without writing much code at first.
jq
Cheat sheet:
In particular, you will need to make use of some of the built-in operators like to_entries
Β and map
.
Here is a command to grab the materialisation of each model
β cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})'
[
{
"node": "model.jaffle_shop.dim_customers",
"materialized": "table"
},
{
"node": "model.jaffle_shop.stg_customers",
"materialized": "view"
}
]
You can then for example store that into a file by piping the output
cat target/manifest.json | jq '.nodes | ...' > my_data_of_interest.json
Parsing Artifacts from Python with `pydantic`
π
Once you get a better idea of what data you need, you might want to develop more custom logic around dbt artifacts. This is where python shines: you can write a script with the logic you need. You can install and import great python libraries. For instance, you could use networkx
to run graph algorithms on your dbt DAG.
You will then need to parse the dbt artifacts in python. I recommend to use the great pydantic library: among other things, it allows to parse JSON files with very concise code that lets you focus on high-level parsing logic.
Here is an example logic to parse manifest.json
:
import json
from typing import Dict, List, Optional
from enum import Enum
from pydantic import BaseModel, validator
class DbtResourceType(str, Enum):
model = 'model'
analysis = 'analysis'
test = 'test'
operation = 'operation'
seed = 'seed'
source = 'source'
class DbtMaterializationType(str, Enum):
table = 'table'
view = 'view'
incremental = 'incremental'
ephemeral = 'ephemeral'
seed = 'seed'
class NodeDeps(BaseModel):
nodes: List[str]
class NodeConfig(BaseModel):
materialized: Optional[DbtMaterializationType]
class Node(BaseModel):
unique_id: str
path: Path
resource_type: DbtResourceType
description: str
depends_on: Optional[NodeDeps]
config: NodeConfig
class Manifest(BaseModel):
nodes: Dict["str", Node]
sources: Dict["str", Node]
@validator('nodes', 'sources')
def filter(cls, val):
return {k: v for k, v in val.items() if v.resource_type.value in ('model', 'seed', 'source')}
if __name__ == "__main__":
with open("target/manifest.json") as fh:
data = json.load(fh)
m = Manifest(**data)
Once you've got the Manifest
class, you can use it in your custom logic. For example, in our use case from above where we want to check for model materialization, we can do:
>>> m = Manifest(**data)
>>> [{"node": node, "materialized": n.config.materialized.value} for node, n in m.nodes.items()]
[
{
"node": "model.jaffle_shop.dim_customers",
"materialized": "table"
},
{
"node": "model.jaffle_shop.stg_customers",
"materialized": "view"
}
]
Example Application 1: Detecting a Change in Materializationπ
Let's say you want to check that no materialisation has changed before you run dbt run
. This is useful because some materialization changes require a --full-refresh
. You could achieve the change detection with the following commands:
β cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})' > old_state.json
β # code change: let's say one model materialization is changed from table to view
β dbt compile
β cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})' > new_state.json
β diff old_state.json new_state.json
12c12
< "materialized": "table"
---
> "materialized": "view"
Example Application 2: Compute Model Centrality with networkx
π
Once you've parsed the manifest.json
, you have at your disposal the graph of models from your project. You could explore off-the-shelf graph algorithms provided by networkx
, and see if any of the insights you get are valuable.
For example, nx.degree_centrality can give you the list of models that are "central" to your project. You can use that e.g. to priotise maintenance efforts. In the future, you could imagine a dbt docs
search that prioritises results based on this metric as a very simple PageRank proxy.
Once you've written the pydantic code from above, this turns out to be possible in a very small amount of lines.
import networkx as nx
# ... pydantic code from above for Manifest class
class GraphManifest(Manifest):
@property
def node_list(self):
return list(self.nodes.keys()) + list(self.sources.keys())
@property
def edge_list(self):
return [(k, d) for k, v in self.nodes.items() for d in v.depends_on.nodes]
def build_graph(self) -> nx.Graph:
G = nx.Graph()
G.add_nodes_from(self.node_list)
G.add_edges_from(self.edge_list)
return G
if __name__ == "__main__":
with open("target/manifest.json") as fh:
data = json.load(fh)
m = GraphManifest(**data)
G = m.build_graph()
nx.degree_centrality(G)
Example Application 3: Graph visualisationπ
Provided you use python 3.8+, there is another dbt artifact that can be interesting to you: graph.gpickle
. Instead of parsing manifest.json
and building the graph yourself, you can deserialize the networkx graph built by dbt itself.
All it takes is 2 lines!
That's hard to beat, but note that you will rely on the internal graph definition of dbt
and won't be able to customise it. For example, tests will be nodes on your graph now.
import networkx as nx
G = nx.read_gpickle("target/graph.gpickle")
Nevertheless, this can be useful for example for a quick visulisation using pyvis:
from pyvis.network import Network
nt = Network("500px", "1000px", notebook=True)
nt.from_nx(G)
nt.show("nx.html")
Referencesπ
- dbt (data build tool) - Transform data in your warehouse
- dbt Artifacts | docs.getdbt.com
- dbt Artifacts | next.docs.getdbt.com
- dbt Command reference | docs.getdbt.com
- jq, a lightweight and flexible command-line JSON processor.
- jq Manual | Builtin operators and functions
- pydantic docs
- NetworkX β NetworkX documentation
- networkx.algorithms.centrality.degree_centrality β NetworkX 2.5 documentation
- Interactive network visualizations β pyvis 0.1.3.1 documentation