
My ML Engineer stack
Here I will try to structure my personal stack for ML Engineering, and update it over time.
How to use this blog post
- This serves me to structure the code of my projects (e.g. each h2 could be a python module).
- This can also serve me as a checklist for reviewing a past/someone else’s project.
Part 1: Data Engineering
- Creating the Data Repository
- Use
S3
data lake or a combinedRedshift
DWH + S3 Lake - do the S3 buckets used have lifecycle configurations?
- think about the price vs fast loading time tradeoff and consider using:
FSx Lustre
EFS
EBS
- Use
- Writing the Data Ingestion
- consider staying outside of
Apache Spark
when possible - ingest from DWH to S3: usedbt
+UNLOAD
command of Redshift - ingest from S3 to S3 using python FaaS (e.g.pandas
ordask
) - orchestrate those batch transformations with
Apache Airflow
- default back to Spark when its value-add trumps the complexity overhead (e.g.
pyspark.sql.functions.explode
) - use parquet for the data lake default format (see my other post)
- if batch ingesting (and transforming) in Spark, use
Glue
overEMR
- for streaming, default to
Kinesis
and think of those 3 use cases: - useKinesis Firehose
to ingest into S3, and then transform in batches - useKinesis Data Analytics
to run ML models on the stream (e.g. RandomCutForest on the clickstream data for fraud detection) - useKinesis Client Library
to read the data off the stream in other services like EMR
- consider staying outside of
- Writing the Data Transformation
- I leverage SQL as much as possible using
dbt
- I default to python FaaS using
pandas
and over python data stack libraries if SQL is unpractical or unnecessary complicated - I default to Spark in Glue if all of those fail
- I leverage SQL as much as possible using
Part 2: Exploratory Data Analysis
- Sanitise and prepare data for modelling
- dataset level stats, feature level stats, outlier removal, inputting
- use
pandas_profiling
:) - for manual labelling, use
doccano
(it’s free) (orSagemaker Ground Truth
). doccano doesn’t have active learning baked in, as opposed to Ground Truth orprodigy
- Feature engineering
- dimensionality reduction techniques (PCA, t-SNE, UMAP)
- scaling on numerical and one-hot-encoding on categorical features
- Analyse and Visualise data
- combine the dimensionality reduction techniques with plot with hue by category
- use confusion matrices a lot
Part 3: Modeling
- Model framing
- think of the business metric you’re tying to improve
- think of the model at deployment time: - what features are available then - how users and backend systems interact with predictions
- Model search
- the more models in your toolbelt the better
- but at some point you have to stop trying new things
- track experiments in
MLflow
- Optimiser choice (= what to do when SGD fails):
- Adam stands for adaptive momentum which can help the model converge faster and get out of being stuck in local minima
- Adagrad is an algorithm for gradient-based optimization that adapts the learning rate to the parameters by performing smaller updates and, in turn, helps with convergence.
- RMSProp uses a moving average of squared gradients to normalize the gradient itself, which helps with faster convergence.)
- Hyperparamter optimisation
- Gridsearch is a good baseline, but Random search is better
- Bayesian optimisation is better than random search
- Gridsearch and Random search are in
scikit-learn
- the 3 of them are in
Sagemaker
hyperparameter tuning jobs (sagemaker.tuner.HyperparameterTuner
)
- Evaluation
- use sklearn
Part 4: MLOps
- Build ML solutions for performance, availability, scalability, resiliency, and fault tolerance
- think of using
SQS
queues for loose-coupling systems in a fault-tolerant way - Sagemaker endpoints do a lot of the heavy lifting for you already
- Deploy using this stack
Sagemaker endpoint
+Lambda
+API Gateway
- Enable historical Sagemaker logs using
Cloudtrail
- think of using
- When to use what
- a off-the-shelf high-level AI service (e.g.
Rekognition
) - vs when to tune a SageMaker default algo
- vs when yo bring your own model (e.g. pytorch or tensorflow) using Docker + Sagemaker
- a off-the-shelf high-level AI service (e.g.
- Deploy and operationalize the ML models
- How to update a Sagemaker endpoint w/o downtime: De-register the endpoint as a scalable target. Update the endpoint using a new endpoint configuration with the latest model Amazon S3 path. Finally, register the endpoint as a scalable target again.
- use Sagemaker endpoint’s production variants to split traffic between multiple models. You can use this to do canary deploys by iterating on the weight ratio.