AWS ? Why We Need To Make Machine Learning Happen Faster

moimesgentpefo
Aug 17, 2023
6 min read

Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications. These instances deliver up to one petaflop of mixed-precision performance per instance to significantly accelerate machine learning and high performance computing applications. Amazon EC2 P3 instances have been proven to reduce machine learning training times from days to minutes, as well as increase the number of simulations completed for high performance computing by 3-4x.

AWS – Why we need to make Machine Learning happen faster

Download Zip

With up to 4x the network bandwidth of P3.16xlarge instances, Amazon EC2 P3dn.24xlarge instances are the latest addition to the P3 family, optimized for distributed machine learning and HPC applications. These instances provide up to 100 Gbps of networking throughput, 96 custom Intel Xeon Scalable (Skylake) vCPUs, 8 NVIDIA V100 Tensor Core GPUs with 32 GiB of memory each, and 1.8 TB of local NVMe-based SSD storage. P3dn.24xlarge instances also support Elastic Fabric Adapter (EFA) which accelerates distributed machine learning applications that use NVIDIA Collective Communications Library (NCCL). EFA can scale to thousands of GPUs, significantly improving the throughput and scalability of deep learning training models, which leads to faster results.

For data scientists, researchers, and developers who need to speed up ML applications, Amazon EC2 P3 instances are the fastest in the cloud for ML training. Amazon EC2 P3 instances feature up to eight latest-generation NVIDIA V100 Tensor Core GPUs and deliver up to one petaflop of mixed-precision performance to significantly accelerate ML workloads. Faster model training can enable data scientists and machine learning engineers to iterate faster, train more models, and increase accuracy.

One of the most powerful GPU instances in the cloud combined with flexible pricing plans results in an exceptionally cost-effective solution for machine learning training. As with Amazon EC2 instances in general, P3 instances are available as On-Demand Instances, Reserved Instances, or Spot Instances. Spot Instances take advantage of unused EC2 instance capacity and can lower your Amazon EC2 costs significantly for up to a 70% discount from On-Demand prices.

Use pre-packaged Docker images to deploy deep learning environments in minutes. The images contain the required deep learning framework libraries (currently TensorFlow and Apache MXNet) and tools and are fully tested. You can easily add your own libraries and tools on top of these images for a higher degree of control over monitoring, compliance, and data processing. In addition, Amazon EC2 P3 instances work seamlessly together with Amazon SageMaker to provide a powerful and intuitive complete machine learning platform. Amazon SageMaker is a fully-managed machine learning platform that enables you to quickly and easily build, train, and deploy machine learning models. Furthermore, Amazon EC2 P3 instances can be integrated with AWS Deep Learning Amazon Machine Images (AMIs) that are pre-installed with popular deep learning frameworks. This makes it faster and easier to get started with machine learning training and inference.

You can use multiple Amazon EC2 P3 instances with up to 100 Gbps of networking throughput to rapidly train machine learning models. Higher networking throughput enables developers to remove data transfer bottlenecks and efficiently scale out their model training jobs across multiple P3 instances. Customers have been able to train ResNet-50, a common image classification model, to industry standard accuracy in just 18 minutes using 16 P3 instances. This level of performance was previously unattainable by the vast majority of ML customers as it required a large CapEx investment to build out on-premises GPU clusters. With P3 instances and their availability via an On-Demand usage model, this level of performance is now accessible to all developers and machine learning engineers. In addition, P3dn.24xlarge instances support Elastic Fabric Adapter (EFA) that uses the NVIDIA Collective Communications Library (NCCL) to scale to thousands of GPUs.

Amazon EC2 P3 instances support all major machine learning frameworks including TensorFlow, PyTorch, Apache MXNet, Caffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), Chainer, Theano, Keras, Gluon, and Torch. You have the flexibility to choose the framework that works best for your application.

Airbnb is using machine learning to optimize search recommendations and improve dynamic pricing guidance for hosts, both of which translate to increased booking conversions. With Amazon EC2 P3 instances, Airbnb can run training workloads faster, go through more iterations, build better machine learning models and reduce costs.

NerdWallet is a personal finance startup that provides tools and advice that make it easy for customers to pay off debt, choose the best financial products and services, and tackle major life goals like buying a house or saving for retirement. The company relies heavily on data science and machine learning (ML) to connect customers with personalized financial products.

Pinterest uses mixed precision training in P3 instances on AWS to speed up training of deep learning models, and also uses these instances for faster inference of these models, to enable fast and unique discovery experience for users. Pinterest uses PinSage, made by using PyTorch on AWS. This AI model groups images together based on certain themes. With 3 billion images on the platform, there are 18 billion different associations that connect images. These associations help Pinterest contextualize themes, styles and produce more personalized user experiences.

Salesforce is using machine learning to power Einstein Vision, enabling developers to harness the power of image recognition for use cases such as visual search, brand detection, and product identification. Amazon EC2 P3 instances enable developers to train deep learning models much faster so that they can achieve their machine learning goals quickly.

Amazon SageMaker is a fully-managed service for building, training, and deploying machine learning models. When used together with Amazon EC2 P3 instances, customers can easily scale to tens, hundreds, or thousands of GPUs to train a model quickly at any scale without worrying about setting up clusters and data pipelines. You can also easily access Amazon Virtual Private Cloud (Amazon VPC) resources for training and hosting workflows in Amazon SageMaker. With this feature, you can use Amazon Simple Storage Service (Amazon S3) buckets that are only accessible through your VPC to store training data, as well as storing and hosting the model artifacts derived from the training process. In addition to S3, models can access all other AWS resources contained within the VPC. Learn more.

Amazon SageMaker makes it easy to build machine learning models and get them ready for training. It provides everything that you need to quickly connect to your training data, and to select and optimize the best algorithm and framework for your application. Amazon SageMaker includes hosted Jupyter notebooks that make it easy to explore and visualize your training data stored in Amazon S3. You can also use the notebook instance to write code to create model training jobs, deploy models to Amazon SageMaker hosting, and test or validate your models.

You can begin training your model with a single click in the console or with an API call. Amazon SageMaker is pre-configured with the latest versions of TensorFlow and Apache MXNet, and with CUDA9 library support for optimal performance with NVIDIA GPUs. In addition, hyper-parameter optimization can automatically tune your model by intelligently adjusting different combinations of model parameters to quickly arrive at the most accurate predictions. For larger scale needs, you can scale to tens of instances to support faster model building.

An alternative to Amazon SageMaker for developers who have more customized requirements, the AWS Deep Learning AMIs provide machine learning practitioners and researchers with the infrastructure and tools to accelerate deep learning in the cloud, at any scale. You can quickly launch Amazon EC2 P3 instances pre-installed with popular deep learning frameworks such as TensorFlow, PyTorch, Apache MXNet, Microsoft Cognitive Toolkit, Caffe, Caffe2, Theano, Torch, Chainer, Gluon, and Keras to train sophisticated, custom AI models, experiment with new algorithms, or learn new skills and techniques. Learn more >>

Amazon EC2 P3dn.24xlarge instances are the fastest, most powerful, and largest P3 instance size available and provide up to 100 Gbps of networking throughput, 8 NVIDIA V100 Tensor Core GPUs with 32 GiB of memory each, 96 custom Intel Xeon Scalable (Skylake) vCPUs, and 1.8 TB of local NVMe-based SSD storage. The faster networking, new processors, doubling of GPU memory, and additional vCPUs enable developers to significantly lower the time to train their ML models or run more HPC simulations by scaling out their jobs across several instances (e.g., 16, 32, or 64 instances). Machine learning models require a large amount of data for training and, in addition to increasing the throughput of passing data between instances, the additional network throughput of P3dn.24xlarge instances can also be used to speed up access to large amounts of training data by connecting to Amazon S3 or shared file systems solutions such as Amazon EFS.

P3dn.24xlarge instances offer NVIDIA V100 Tensor Core GPUs with 32GiB of memory that deliver the flexibility to train more advanced and larger machine learning models as well as process larger batches of data such as 4k images for image classification and object detection systems.

One of the many advantages of cloud computing is the elastic nature of provisioning or deprovisioning resources as you need them. By billing usage down to the second, we enable customers to level up their elasticity, save money, and enable them to optimize allocation of resources toward achieving their machine learning goals. 2ff7e9595c

AWS ? Why We Need To Make Machine Learning Happen Faster

AWS – Why we need to make Machine Learning happen faster

Recent Posts

Comments