Truly embrace volatility and save 90% on Deep Learning training costs using Spot Instances

Thursday, 4 Apr, 2019   /   Zag


Deep Learning (DL) is a subset of the Machine Learning (ML) research field which has arisen from the need to push the boundaries of what is possible and accurately learn how to solve even more complex problems. Much like its predecessor, DL is also concerned with learning algorithms and data representations inspired by the structure and function of the human brain. However, instead of expecting all features to be provided manually prior to model training, DL endeavours to automatically uncover these learning data representations. Sounds exciting, right?

Having been at the centre of numerous ML conversations in the past few years, leaders and experts in the DL field have come to the following consensus;

  1. There is a window of opportunity to develop extraordinary new AI-based technologies with DL that have significant commercial and societal value.  
  2. DL is not a product that you can get off-the-shelf and if you genuinely want to do something that has never been done before, it will take significant time (and money) to develop it.
  3. DL is rather costly given how demanding it can be in terms of the hardware needed. (No wonder there is a race between major cloud players to develop faster Graphic Process Units (GPUs), Field-Programmable Gateway Arrays (FPGAs) and Application Specific Integrated Circuit (ASICs) products!)

Although points one and three are widely known, little has been done to address the cost factor meaningfully, so far. But this is where my research comes in.

The problem

Environment volatility has long been misunderstood by the technical community, as the fear of “losing” prime capacity in virtual computing environments is almost too much to bear. However, there are many benefits of embracing volatility, particularly with Amazon’s Elastic Compute Cloud (Amazon EC2), when it comes to DL experimentation.

Considering that Data Scientists don't usually nail DL models on their first try, training cost will likely be multiplied by the number of attempts they had to make. Honestly speaking, it isn't rare to see this reflected in several thousand-dollar bills at the end of the month, which basically leaves us with two big immediate actions when it comes to Amazon Web Services (AWS).

  1. To reserve an instance: It does provide a generous discount when compared to On-demand pricing and, under certain criteria, capacity reservation. The problem with this approach is that it comes with a commitment of a year or three, which in most cases we’re not willing to make. 
  2. To use Spot Instances (unused EC2 instances available in the AWS cloud): It comes at an even steeper discount than On-demand pricing, however, at one cost; the application running against it must be interruptible/resumable (as Spot Instances can be interjected by AWS with two minutes of notification when AWS needs the capacity back). This involves knowing more than a typical Data Scientist would, which is why I suspect that the last pillar hasn't been tackled yet, despite its utter importance.

The good news is that most of the DL frameworks, such as Caffe and TensorFlow, support resuming their training process from a "checkpoint". However, these features usually operate on the premise that you always have access to the instance where it’s running, and yet, developing a native integration mechanism with Spot Instances (and alike models) is out of their scope.

To know more about the Spot Instances model, please refer to:

The Solution

While working in the Zag Research & Development Team, I’ve developed a solution for making Caffe Training more susceptible to deal with Spot Instance volatility while still obtaining 70% savings. Yes, you read that right.

The solution comprises of three main components, as depicted in the architecture diagram below.

  1. An EC2 Spot Instance with a shell script responsible for dealing with the Caffe train resuming which will be injected via the UserData field at launch time. This instance will be placed in an Auto Scaling group that will replace the instance whenever it’s taken down and the bid offer is still above the threshold. 
  2. A S3 bucket in which the Caffemodel and Solverstate files will be synced on a regular basis. This ensures two essential behaviours; 
    • Caffemodel/Solverstate file snapshots containing the so-desired weights are uploaded to a high-available storage service in which they will be replicated several times for high-durability. 
    • If for some reason a Spot Instance is taken from you, the next instance launched will download the last synced files which makes the resuming process feasible with the help of the script on the new instance. 

The key to this solution is held at the script injected into the instance at launch via the user data field. You can see its source-code here: 


In a nutshell, this script takes care of syncing models to a S3 bucket as well as building the appropriate command to be executed by the bash. It also checks for the GPUs attached to the instance to specify them on the Caffe train command.

It’s worth mentioning that some of these arguments can be overridden by 'solver*.prototxt’ which would make it not work the way you want, so be sure this isn’t happening to you.

How can I use it? 

I put together a CloudFormation template that automates the process of creating all required AWS resources and injecting the script on the user data field with a click of a button. You can find it here:

For the rest of this blog post I will assume that you know what CloudFormation is and why you should be using it. To know more about CloudFormation, please refer to:

Without further ado, let’s get our hands dirty.

Step 1 – Go to the CloudFormation Console and click the ‘Create a Stack’ button 


Step 2 – Specify the input parameters according to your use case and click on the ‘Next’ button


Before moving on, let’s dive deep into what those parameters are and their assumptions. Although CloudFormation sorts the parameters alphabetically, your line of thought when interpreting must be:




EC2 image ID which contains your dataset ready for training and Caffe exported to PATH. 


Absolute path of the root directory from where Caffe train command must be initialised. 


Relative path from the CaffeTrainingDir parameter above where the solver_train.prototxt is located. 


The instance type that you want to launch.

Pro-tip: Since you’re using spot, you might as well shy away from the coveted instance types and choose one that has the necessary hardware only. This is relevant as this will cause the price variation on the spot market to stay within your bid for a substantial amount of time in most cases. 




Maximum price that you are willing to pay for the instance chosen. 


EBS size attached to the instance. 


An EC2 KeyPair to be associated with the instance, if you ever want to SSH into it.


CIDR block that you want to allow SSH into this instance. 


Step 3 – Navigate through the wizard, mark the IAM creation permission checkbox and click on the ‘Create’ button 


Step 4 – Wait for all resources to reach the ‘CREATE_COMPLETE’ state 


That’s it! Wasn’t it easy? Now I’ll take you to the place where you can find the resources produced by this solution. 

Weight and solver state files will be located at the S3 bucket created by the cloud formation template as shown below;


To see Caffe training logs, go to the CloudWatch Logs console and look for a “LogGroup” called train.log, as shown below;


Finally, in case you want to see how much you’ve saved by using this solution, just go to the EC2 Console -> Spot Requests -> Savings Summary 



We went through the process of automating Caffe training to tackle the restrictive cost factor of DL and we were able to save 70% of the original cost. Undoubtedly, this can help other companies and enthusiasts make DL more powerful and affordable for everyone.

Zag are proud of our innovation focus, both from the perspective of what we have delivered for our customers and of our contribution to open source communities. We want as many businesses and communities to benefit from our innovation as possible, therefore we encourage other developers to contribute to our code and make it even better.

If you would like Zag to make any of what is mentioned above a reality for your organisation, please do not hesitate to get in touch with me or via our website Check us out, and if you are interested, please don’t be afraid to enquire about our solutions or joining our awesome team.

Written by Paulo Almeida, R&D Architect for Zag.

Which SAP products are the right fit for your business?