Nearly five years ago, Zachary Stevens and co (PloS Biology, 2015) extrapolated a trend in genomic data generation for ten years, predicting an exciting and terrifying future of exabyte-scale genomic datasets in the early 2020s. The genomics community had experienced 3x year-over-year data growth that indicated no sign of slowing down, and analyzing the largest datasets with existing tools was becoming infeasible. Geneticists processing these datasets were spending less time doing science, and more time implementing and running even basic analyses.
The Hail project began in the year 2015, and was tasked with building open-source, scalable tools to enable geneticists to interrogate the largest genomic datasets. Hail has since evolved from a command-line VCF-processing tool to a versatile Python library used for data manipulation and analysis, with a wide array of functionality specifically for genomics.
Modern data science in nearly every domain is powered by dataframe libraries. Pandas, R’s Tidyverse, SAS, SQL engines, and many other tools are designed to process data consisting of observations (rows) of heterogeneous fields (columns). However, these tools cannot easily and efficiently represent genomic data.
Analysis-ready genomic datasets have rows (sites in the genome that vary across individuals), columns (sequenced individual identifiers), and a third dimension, the several fields comprising the probabilistic measurement of an individual’s genotype at a specific site. Hail represents this type of dataset in a first-class way with a core piece of the Hail library, the MatrixTable.
Tip: see names and types of fields at any point with
Hail functionality is built around the MatrixTable interface. This is the core unifying interface for import, export, exploration, and statistical analysis.
Many input formats, one interface
In genomics, there are many file input formats and data representations. Hail unifies the various genomic data formats by importing them to the same data structures, Hail Tables and MatrixTables. Examples of Hail-compatible input formats include the variant call file (.vcf), tab-separated values (.tsv), Oxford genotype (.bgen, .gen), PLINK, gene transfer format (.gtf), browsable extensible data format (.bed), as well as other textual representations.
1) Filter, group, and aggregate
Most exploratory data analysis is built on standard dataframe operations: filtering, grouping, and aggregation.
Easy annotation in a single line for databases such as the variant effect predictor (VEP) or the genome aggregation database (gnomAD).
3) Visualization. Hail includes a plotting library built on bokeh that makes it easy to visualize fields of Hail tables and matrix tables.
4) Statistical analysis toolkit
Hail features a suite of functionality for standard statistical genetics models.
Hail can be used to interrogate biobank-scale genomic data.
Scalability. It’s not feasible to process tens or hundreds of thousands of whole genomes on a single computer. Hail has a backend that automatically and efficiently scales to leverage a compute cluster, such as one deployed on-demand on the cloud, so that users can worry about the contents of their pipelines, rather than how to parallelize them. This means that it’s possible to have the same interactive analysis experience, whether the Jupyter notebook is running on a laptop or backed by 5,000 CPUs on a cloud cluster.
By using more computers, Hail can run large computations in a short amount of wall time, or how long users need to wait for results. Scalability does not come for free, though: even though the wall time decreases when using large clusters, the CPU time (wall clock time multiplied by average number of compute resources, and a proxy for cost) will slightly increase. A perfectly scalable system could run on a huge number of computers without inflating CPU time, making all queries interactive at no additional cost.
Perfect scalability is not feasible, but Hail comes acceptably close -- Hail achieves a ~300x speedup in runtime at ~2x cost.
Efficiency is the total amount of CPU time a tool uses, which translates into how much an analysis costs. Less efficient tools take longer to run, and therefore cost more. More efficient tools can utilize computer hardware better, taking less total time and therefore costing less.
The Hail team is invested in improving both scalability and efficiency. We believe that this is necessary to keep pace with growing genomic datasets.
How do I get started with Hail?
Hail can be run on multiple computational platforms:
Laptop. It’s easy to run Hail on a single machine, like a laptop, for small data. It’s easy to use a laptop to build and test analyses before running them on a full dataset on the cloud.
Cloud. Hail installations come bundled with a tool (
hailctl dataproc ) that can be used to manage Hail clusters running on Google Dataproc on Google Cloud Platform (GCP). Hail can also run on Google Cloud with Terra, as well as on Amazon Web Services (AWS) and Microsoft Azure using community-built tooling.
Head to the Installation page to get started.
More functions, and how-to guides are available in the Hail documentation.
What has Hail produced (so far)?
Hail has seen robust adoption in academia and industry by enabling those without the expertise in parallel computing to flexibly, efficiently, and interactively analyze large genomic datasets. In the past few years, Hail has been the analytical engine behind dozens of studies, ranging from large-scale projects such as the Genome Aggregation Database (gnomAD), the Neale lab’s UK Biobank rapid-GWAS, and most recently, the driving force in the Pan-ancestry genetic analysis of the UK Biobank (Pan-UKBB).
To date, more than 50 publications have cited Hail in published research, and the Hail Python package has been downloaded from the Python Package Index more than 140,000 times.