How to choose the right AWS GPU capable EC2 Instances and launch your first machine learning project for that instance type
The benefits of machine learning (ML) are becoming more easily accessible as an increasing array of tools are becoming available that shorten the onramp for the eager masses who are ready to leverage it for their own projects. The challenge, as it frequently is with new technology, is where to start and how to do so effectively without hitting dead-ends that create unnecessary frustration.
This guide is for those who want to learn how they can increase the velocity of their work by flexing their training to the AWS cloud with NVIDIA GPU ML capable EC2 instances. Readers will be taken step by step to understand what factors determine the best instance choices for ML (including time, cost, and overall workload), and ultimately get started on a training model utilizing the PyTorch tool specifically.
Whether you are just getting started or a skilled practitioner, this guide will provide the information that you need to get started with a ML model on AWS. All you need to get started is an AWS account (which you can get here).
Choosing the right EC2 instance for Machine Learning
When assessing the appropriate resources for a new machine learning project, there are choices that need to be made to produce the desired outcomes within the acceptable time frame. This is especially true for choosing the correct AWS NVIDIA GPU capable EC2 instances. For this tutorial we are going to focus on three specific instances for training ML models:
- Amazon EC2 G4 Instances - The G4 instances are the most cost-effective instance for small scale training and inferencing. Great for early proof-of-concept and situations where time sensitivity is not a limiting factor.
- Amazon EC2 P3 Instances - Accelerate your machine learning with high performance computing in the cloud using the P3 Instances. Use these instances to speed up your training and iteration time so that you can do more with your ML models.
- Amazon EC2 P3dn Instances - Explore larger and more complex machine learning algorithms with twice the power of the P3 Instances. Choose this instance when you are ready for fast turn-around on your model training, or when you have needs for distributed ML training.
There is a lot to understand when selecting the correct instance, but ultimately the decision can be boiled down to the same few factors that virtually all computing decisions revolve around:
1. Scale & overall workload
2. Performance / time to results
Below we will examine the details of each instance type in order to understand the factors for choosing the right NVIDIA GPU capable instances, and following that provide a step-by-step guide on getting started with machine learning training for each instance type.
Before we get started, however, the first thing to understand is the size of the data set that you will be using to train your ML model. With all ML training, the entire model needs to be able to fit onto a single GPU (with some room for overhead). By knowing the size of your data, you are equipped up front with the information you need to determine the correct instance for your needs.
Choosing Amazon EC2 G4 Instances
Simply put, the Amazon EC2 G4 is the best instance for cost-efficient deep learning training and inferencing. G4 instances provide access to NVIDIA T4 GPUs and can be launched as a single GPU instant or as multiple GPU instances (including 4 GPUs or 8 GPUs).
The T4 GPU’s primary use case is inferencing and graphics, but they work great as an instance type for development, testing, prototyping, and small model training.
The G4 instance should be your first consideration for a training instance as you begin building and prototyping the model that you need to train. The G4 instances are great for simple validations, provided that the data set is small enough to fit on the 16 GB NVIDIA T4 Tensor Core GPUs that make up these instances. This is a great entry-level point for validating a model.
To understand the difference in power between the G4 and the P3 instances, consider that the G4 instances use the T4 GPUs, which come with a power cap of 70w - much lower powered than the v100’s available in the Amazon EC2 P3 instances (which have a power cap of 300W). This, of course, tells you quite a bit about the relative performance between these instances - if performance and time to results is your priority, you should consider what the P3 instances have to offer.
However, for use cases such as prototyping and testing, performance may not be your priority. Instead, you might be looking for a solid architecture that can efficiently deliver the power you need to simply get started training and testing your models or flushing out a proof of concept. In cases such as these, the G4 instances are your best bet for cost-effective training.
The starting point for the G4 family of instances is the g4dn.xlarge, which is equipped with 1 T4 GPU units, 4 vCPUs, and 16 GB of system memory. It also comes with up to 125GB of storage and up to 25 Gbps of network bandwidth (including up to 3.5 EBS Bandwidth).
There are seven different instance sizes for the G4 Instance family, detailed in the following graph:
What types of machine learning models are a good fit for the G4?
Many ML model types simply will not run on a G4 because they are too large to fit on the GPUs. Models that may be well suited for ML training on the G4 include:
- Image classification
- Object detection
- Recommendation engines
- Automated speech recognition
- Language Translation
The bottom line: if you are getting started and need an instance to help you get basic, cost conscious access to machine learning training, the G4 family of instances will get you what you need to get moving.
Choosing Amazon EC2 P3 Instances
While the G4 instances are great for cost-efficient training to help get your project established, the Amazon EC2 P3 instances are great for taking things to the next level and speeding up your time to results. The P3 instance is generally considered the best instance for high-performance deep learning training, giving users access to powerful NVIDIA V100 GPUs – a much more powerful GPU than the T2 GPUs available on G4 instances (by roughly 4x). The NVIDIA V100 Tensor Core GPU comes in 16GB and 32GB configurations, giving you an extraordinary amount of memory to work with for your ML models.
The V100 GPU is the world’s most powerful accelerator for deep learning, won MLPerf, the first industry-wide AI benchmark, validating itself as the world’s most powerful, scalable, and versatile computing platform for deep learning and machine learning workloads.
When selecting the Amazon EC2 P3 instance you have the option to select a single V100 GPU, or upgrade to either 4 or 8 V100 GPUs, putting an incredible amount of computing power at your fingertips for training machine learning models. This is ideal for data scientists, researchers and engineers looking to get to their next breakthrough more quickly.
Ultimately, the P3 instance is the right choice where performance is a priority and you are dealing with larger amounts of data that you need crunched more quickly. As with the G4 instances, you can do prototyping and testing, however, because of the amount of power available with the P3 instances, your iteration time will be much faster.
The starting point for the P3 family of instances is the p3.2xlarge, which is equipped with 1 Tesla V100 GPU units (16GB GPU Memory), 8 vCPUs, and 61 GB of system memory. It comes with up to 10 Gbps of network bandwidth (including up to 1.5 EBS Bandwidth).
There are four different instance sizes for the P3 Instance family, detailed in the following graph:
What types of machine learning models are a good fit for the P3?
Virtually any ML model is good for the P3 instance, however there are certain models where the data is so large that the P3 should be the immediate consideration as a starting point (bypassing the G4). These include:
- Financial Modeling
- Autonomous Vehicle Platforms
- Neural Networks
- Deep learning
- Real time processing
The bottom line: Amazon P3 instances are more expensive than G4 instances but for that price, you get a lot more performance, cutting down on your time to results. Amazon P3 instances are the right choice if you are ready to get more performance out of your ML models, need to iterate quicker, and prioritize time to results.
Choosing Amazon EC2 P3dn Instances
By now, you should recognize that the size of your machine learning model coupled with the performance that you need to run it at are the main determining factors when determining an AWS instance for ML training.
For most ML training, the P3 is a great place to start for training and validation of your models. However, once your models are validated and you are ready to move them into production, where speed is a necessary requirement, you may want to consider upgrading to AWS’s P3dn instance, which enables you to scale up for a much faster time to results.
The P3dn uses eight of NVIDIA’s V100 Tensor Core GPUs (with 32 GB of GPU memory), with NVIDIA’s NVLink ® for faster GPU-to-GPU communication. With the P3dn, you can train on a network of GPUs with a total of 256 GP of memory, which is double what you can get with the largest P3 instance.
The main reason to jump to a P3dn instance is to speed your results, however, it should be noted that if you have a team that is working on a machine learning project and there is a need for a larger cluster, this is also a reason to select the P3dn instance.
Bottom line: Amazon P3 instances are a great place to start if you are looking to validate your models. The P3dn is typically used for organizations that have validated models that they are ready to put into production and speeding the results is a requirement. It is also a good instance choice for larger teams that need to use the resources simultaneously.
Below we have walkthroughs to help you set up a training model on these instances for various AWS instances.
- Training a ResNet-50 ImageNet Model using PyTorch on a Single AWS G4 or P3 Instance
- Training a ResNet-50 ImageNet Model using PyTorch on multiple AWS G4 or P3 Instances
- Training a BERT Fine Tuning Model using PyTorch on a single AWS P3 Instance
- Training a BERT Fine Tuning Model using PyTorch on multiple AWS P3 Instances
- Object Detection Training using mask-R-cnn on aAWS P3dn instances