The COVID-19 pandemic has brought about major public health and death toll costs. In light of this global health issue, a group of geneticists sought to identify issues dealing with COVID-19 host genetics by forming the largest GWAS initiative: the COVID-19 Host Genetics Initiative (HGI). The HGI's main objective was to connect 100s of geneticists from around the globe to further understand the variability observed in patients who have been affected by COVID-19, specifically in susceptibility and severity of COVID-19.
Due to the sheer data growth that the world of genomics tend to live in, Hail was one of the main driving forces in the sequencing data that is currently being analysed.
In this post, we present some recommendations that were outlined for the analysts in the HGI working with sequencing technologies. These recommendations were put together by Kumar Veerapen (Hail Support and Community Engagement Manager) and Konrad Karczewski (Faculty, Analytic and Translational Genetics Unit).
We provided a 3 part guideline into the analysis of the data:
PART 1: VARIANT CALLING (Will be published by team GATK)
PART 2 : SAMPLE and VARIANT QUALITY CONTROL (QC)
PART 3: ASSOCIATION ANALYSIS (future post)
In this blog post, we will be focusing on PART 2: SAMPLE and VARIANT QC. We highlight the ability of Hail to handle sequencing datasets efficiently for the most important step in any genomic analytical pipelines : QC.
SAMPLE and VARIANT QC
Sample scripts for this section can be found in this GitHub repository: https://github.com/mkveerapen/covid19_sequencing
1.0 Sample Quality Control
All vcf files can be imported as Hail MatrixTables. This can be achieved using the
import_vcf function in Hail. We highly recommend using this input format and the Hail platform for conducting analytics because of ease of use, and portability.
1.0.1 WES Interval QC
This step is an optional step. For sample QC purposes, we would also suggest to filter for intervals where 85% of samples had a mean coverage of 20X, especially when disparate sequencing platforms are used. Intervals that did not pass this interval QC can then be flagged as "
1.0.2 Sex Imputation
We suggest inferring for sex using the Hail function
impute_sex. This function should be performed on common biallelic SNPs (AF > 0.05) with a high callrate (callrate > 0.97). Suggested thresholds for this function include the following. We would also recommend plotting the data to observe data is within reasonable limits of thresholds set below:
In order to refine inferred sex, we suggest utilizing each sample fraction of chromosome Y coverage that are normalized using chromosome 20 coverage, where aneuploidies can be determined in samples that impute female but have normalized chrY fraction > 0.1, as well as samples that impute male but have normalized chrY fraction < 0.1. If sex was imputed missing, sex would then be marked as ambiguous.
1.0.3 Additional Filters
Recommended filters removing samples that are
- Mean coverage < 20.0
- Ambiguous sex
- Call rate < 97% (0.97)
1.0.4 Relatedness filters
Samples can be filtered to remove one of each pair of related samples using Hail's
maximal_independent_set (uses model free relatedness estimation via PC-Relate). We suggest filtering for samples with second-degree relatedness or higher, where one of each pair of samples with a kinship coefficient of > 0.088 can be removed.
1.0.5 Population Ancestry Inference
To increase accuracy of inferring population ancestry, we recommend selecting an approach based on study ascertainment:
- If population ascertained is relatively homogenous:
We recommend performing PCA projection of the exome data onto the gnomAD population principal components and then to use a random forest classifier trained on gnomAD ancestry labels to assign ancestry to the exome samples. (Loadings can be found in this bucket: gs://gnomad-public/release/2.1/pca/gnomad.r2.1.pca_loadings.ht). We would suggest running RF classifier on the PCs from 1000 Genomes or HGDP. If you are on GRCh38, we do have a newly publicly available genotyped callset for 1000Genomes+HGDP (https://gnomad.broadinstitute.org/downloads#v3-hgdp-1kg).
- If population ascertained is relatively heterogeneous (multi-ancestry):
We recommend using a hybrid approach that would first be PCA projection expressed in point a). Secondly, to account for highly admixed samples, we recommend that a from-scratch PCA be performed on the exome dataset using an unsupervised learning/clustering method, e.g. HDBSCAN. Using this hybrid method, any sample that was assigned to a cluster using the from-scratch PCA is given that cluster as their ancestry assignment in order to preserve the substructure observed compared to a projection PCA method. Any sample that was not assigned to a cluster was given the label from the PCA project and random forest classification.
Methods outside of the above mentioned are welcome, if the user has good enough reason to choose otherwise.
1.0.6 Outlier Detection
Utilizing the Hail
sample_qc method, we suggest removing outliers that deviate from the median and median absolute deviation (MAD) (non-parametric equivalent for mean and standard deviation) for the following metrics. It is also important to note that these outlier detection metrics below would need to be stratified by population ancestry (and sequencing platform) determined from subsection 1.0.5:
n_snp: Number of SNP alternate alleles
r_ti_tv: Transition/transversion ratio
r_insertion_deletion: Insertion/Deletion allele ratio
n_insertion: Number of insertion alternate alleles
n_deletion: Number of deletion alternate alleles
r_het_hom_var: Heterozygous/homozygous call ratio
1.1 Variant Quality Control
Upon completion of the Sample QC described in section 1.0, exomes should then be processed for Variant QC that is will be based on analysis method chosen e.g. SAIGE, REGENIE. We recommend applying a PASS filter using the Variant Quality Score Recalibration (VQSR) metric.
1.2 Genotype Quality Control
High quality genotypes can be filtered when applying the following thresholds. We would also recommend performing call rate filtering separately for cases and controls: differential missingness is a typical source of false positives:
AB>= 0.25 (for each allele in heterozygous calls)
1.3 Functional Annotation
Variants can be annotated using the Variant Effect Predictor (VEP) annotation as implemented in Hail (
annotation_db) using the default parameters for GRCh38 (including LOFTEE). In addition, for downstream gene-based tests, we recommend grouping variants into genes by canonical transcripts in Ensembl Gene ID and/or HGNC symbols with the following annotations:
- pLoF: High-confidence LoF variants (LOFTEE), including stop-gained, essential splice, and frameshift variants, filtered according to a set of first principles as described on the Github repo and gnomAD
- Missense | Low-confidence(LC): Missense variants are grouped with in-frame insertions and deletions, as well as low-confidence LoF variants (filtered out by LOFTEE). The latter have a frequency spectrum consistent with missense variation, and affect a set of amino acids in a similar fashion (e.g. a frameshift in the final exon).
- synonymous: All synonymous variants in the gene (control set).
- Additional VEP or machine learning method annotations available e.g. ‘splice_region_variant’ or kipoi repository (ref: Julien Gagneur)
Now that your data has been QC-ed and functionally annotated, you are ready to start analyzing your exomes/genomes. If you need sample scripts, head over to this GitHub repository: https://github.com/mkveerapen/covid19_sequencing
Currently, the analysis of the exomes and genomes for the HGI are underway!
The bulk of the analyses is being led by Brent Richards and Guillaume Butler-Laporte (McGill University, Canada); and Alessandra Renieri and Francesca Mari (University of Sienna, Italy).
We would love to hear from you, Hail users. What do you think of the outline? Are there practices that you prefer over what we have suggested? And why? Leave your thoughts in our comments section.