- This article is intentionally high-level and devoid from technical jargon.
- That said, at times, I will mention a few technologies/services to give you more concrete examples. For these examples, I will be referring AWS services but you should be able to find similar offerings on other cloud providers.
In 2017-18, I have designed and built an end-to-end ML pipeline with the different components discussed below for Amazon Advertising: the team responsible for running all personalized display advertising on behalf of Amazon worldwide with millions of Transactions per Second (TPS) and <100 ms latency.
In 2018-19, while working on Amazon Alexa, we had to manage the deployment of our models, but being a newly formed team, we faced different trade-offs as compared to Amazon Advertising.
During the past few years, I have also consulted startups and individuals on managing their ML engineering-related problems.
Why Invest in an ML Pipeline
For these reasons, you need a reliable way of managing them. Managing in this case would not only mean building a reproducible way of deploying them to production, but also keeping an eye on these models using monitoring and alarming as well as other aspects discussed below.
What is an ML Pipeline
- Dataset generation: responsible for fetching, joining and aggregating data from the different sources, cleaning the data, applying different filters (ex: subsampling) and generating cleaned training and validation datasets
- Training: responsible for training the ML model, performing hyper parameter optimization and generating a trained model
- Validation: responsible for running the trained model on the validation dataset and generating validation metrics
- Deployment: responsible for copying the trained model to production service (if the model passes certain validation criteria and is outperforming the current model in production)
- Monitoring & Alarming: there are at least 2 sets of metrics and alarms depending on the role:
- Engineering: responsible for metrics such as model latency, memory usage, CPU usage, etc.
Science: responsible for metrics such as model bias, input or output distribution drift, etc.
For that reason, dataset generation, training, validation and deployment could just be a series of Python scripts that you run manually following a runbook (checked-in in your Git repo or posted on your internal wiki). Depending on your use case, monitoring & alarming may be a bonus point. Whichever path you choose, make sure that there is a reproducible and documented way of getting models into production. This is crucial and will come handy when things go awry.
Note that if you are already familiar and can move fast by using off-the-shelf services like SageMaker then by all means use it. The important thing at this stage is to optimize for moving fast and experimenting.
Short to Medium Term
Note that, at this stage, you still value moving as fast as you possibly can and thus you want to spend the least time possible on engineering or maintenance of ML pipelines. For example, you would rather spend more engineering time building more valuable features for your customers.
For these reasons (improving robustness while keeping the investment low), you should consider using off-the-shelf services like SageMaker (which supports all core components of ML pipelines mentioned above). Such services will allow you to have robust ML pipelines with auto-scaling, ease of rollback, etc. without investing too much engineering time. Note that there there is still effort to be invested in setting things up, but for the most part, once that cost is paid, it should be “hands off the wheel” experience in terms of deployment and scaling.
However, SageMaker will cost more than running your own infrastructure on EC2 for example. The trade-off then is the cost on one hand and robustness & speed of experimentation on the other hand. That cost can be justifiable at this stage since you are saving precious engineering time.
Long-Term Investment (Custom Build)
Concretely, you will borrow the features that you care about from the off-the-shelf services and add more advanced custom ones depending on your use case. For example, at this stage, the pipeline may look like follows:
- Dataset generation is automated and run at specified schedule (ex: daily or dynamically when input or output model distribution changes in production): a cluster (ex: Spark) is spun up, runs a script to gather, transform and output the data to a durable place (ex: S3)
- Once a new dataset is generated, this will trigger the training process. The output of the training step may not only be the trained model, but also other meta information (ex: hyperparameters used, dataset version, model version, etc.)
- Once the model is trained, a validation step is triggered and it compares how the newly trained model performance against the current model in production (ex: comparing their performance on a Golden Standard dataset but beware of dataset drift)
- If model passes validation, the model gets deployed to production
- Monitoring service then kicks in and keeps an eye on the important metrics and triggers alarms when an anomaly is detected and may support rolling back a model automatically
- You also have the ability to manually intervene and roll back your models
I hope you got a good overview of what an ML pipeline is, why build one and a few points to keep in mind depending on the stage your company or team is in.
If you are Arabic speaking, I have created 2 short courses to get you started in Data Science as well a short course with career tips (the course I wish existed when I was in the last few years of university or just graduating), which are free to access.