Functional Programming for Data Engineering Pipelines that use Python Pandas dataframes
Data
If you're maintaining a codebase that uses pandas dataframes heavily, you might have felt this pain already. Your files are getting longer, debugging the data transformations is getting slower.
When it comes to Data Engineering, Functional Programming has proven its value already and I won't come back on this in this post. If you're not convinced, just have a look at the seminal piece by Maxime Beauchemin (creator of Apache Airflow and Apache Superset) Functional Data Engineering — a modern paradigm for batch data processing.
But, of all the Data Engineering or Machine Learning Operations tools, one is at the same time used a lot, and harder to adopt functional programming with: pandas dataframes. I will show more niche ways to write pandas code that has served me well in previous roles or at previous clients to reduce tech debt, and make Data Engineering in pandas more fun.
Functional Programming in Python
For an in depth look, have a read at Functional Programming in Python: When and How to Use It.
>>> animals = ["ferret", "vole", "dog", "gecko"]
>>> sorted(animals, key=lambda s: -len(s))
['ferret', 'gecko', 'vole', 'dog']
Functional Programming in Pandas
For an intro to the topic, have a read at Method chaining across multiple lines in Python.
Let's use this dataframe as an example:
import pandas as pd
df = pd.DataFrame.from_records([
{"name": "Alice", "age": 24, "state": "NY", "point": 64},
{"name": "Bob", "age": 42, "state": "CA", "point": 92},
{"name": "Charlie", "age": 18, "state": "CA", "point": 70}
])
Bad: entry-level pandas
df["point_ratio"] = df['point'] / 100
df["surrogate_key"] = df["name"] + "-" + df["age"].astype(str) + "-" + df["state"]
df = df.drop(columns='state')
df = df.sort_values('age')
df = df.head(3)
While still maintaining one transformation per line, there are mentions of df
everywhere.
We are not explicit about the fact that we rely on the transformations to happen in the order we wrote them.
Also, you can see with the surrogate_key
transformation that the readability of the code decreases when the transformation complexity increases.
Better: pandas functional API
result = (
df
.assign(point_ratio=lambda d: d['point'] / 100)
.assign(surrogate_key=lambda d: d.apply(lambda r: f"{r['name']}-{r['age']}-{r['state']}", axis=1))
.drop(columns='state')
.sort_values('age')
.head(3)
)
Using .assign
and parenthesis ()
, we anchor our approach in functional programming.
Each transformation is on its own line, and there are no more mentions of df
.
We are explicit about the transformations order.
On the other hand, the surrogate_key
transformation is hard to write:
- There are two nested
lambda
functions - we iterate on rows using
.apply
andaxis=1
, which adds complexity - we are using unspoken rules like naming
d
the parameter of typepd.DataFrame
, and namingr
the parameter which is a "Row" of the dataframe.
Because code is read more than it's written, investing the time to write this code is still worth it for teams. But we can do better
Best: use pandas.DataFrame.itertuples
with the functional API
result = (
df
.assign(point_ratio=lambda d: d['point'] / 100)
.assign(surrogate_key=lambda d: [f"{user.name}-{user.age}-{user.state}" for user in d.itertuples(name="User")])
.drop(columns='state')
.sort_values('age')
.head(3)
)
We take the same approach as before, but we tweak the surrogate_key
transformation.
This time:
- no nested
lambda
- we iterate over rows using
itertuples
, which maintainsdtypes
of the rows and that gives usNamedTuple
objects - explicit variable name
user
instead ofr
previously
Conclusion
In this short article, I have showed you a new way to write your pandas data pipelines that can be leveraged to write more explicit and maintainable code for Data Engineering.