Genomics Analytics on Google Cloud Platform

Running genomics analytics on the Google Cloud Platform (GCP) lets researchers take advantage of unique Genome Analysis Toolkit (GATK) optimizations and Intel® Xeon® Scalable processors support for Intel® AVX acceleration to speed time to insight. Increasingly, the scientific community is looking to harness cloud-based computing for genomics analytics. This is especially true for projects with unpredictable compute utilization and when data sets are already located in the public cloud. The GATK provides industry-standard tools for identifying single nucleotide polymorphisms (SNPs) and indels in germline DNA and RNA-seq data. This guide provides high-level recommendations for using the Google Cloud Platform (GCP) and GATK, specifically for a Germline Variant Calling pipeline, also known as secondary analysis. Choosing the right cloud infrastructure is important when performing whole genome variant calling in the cloud. You should consider dataset size, which pipeline you are using, and various workload characteristics. Intel and the Broad Institute collaborated to create the Genomics Kernel Library (GKL), a collection of Intel-optimized libraries used throughout the genomics workflow. GKL includes compression and decompression libraries, as well as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) implementations of common genomics tools. GKL is distributed open source with the GATK and enables faster runtimes and more samples processed per day. Performance also benefits from the use of fast local storage, including Intel® 3D NAND NVMe SSDs. This guide provides information on the GCP instances that use Intel AVX-512: N1, N2, and C2. It also describes how to automate Best Practice Pipeline workflow on GCP and provides links to more details about running genomics analytics in the cloud.