Slicing and dicing aligned genomic and transcriptomic reads for genetic epidemiology

Slicing and dicing aligned genomic and transcriptomic reads for genetic epidemiology


Author(s): Peter Yizhou Huang,Lauren Marie Harmon,Xiaotu Ma,Tim Triche

Affiliation(s): Van Andel Institute



Rare diseases and conditions present special difficulties for genetic epidemiology. For example, all childhood cancers are rare diseases, as are almost all cancer predisposition syndromes, the majority of primary immunodeficiency conditions, and most chromosomal birth defects. In aggregate, however, rare diseases are not rare; approximately 20% of medical consultations for an identifiable syndromic condition will eventually resolve to a rare genetic condition. Thus tools to overcome these limitations are not only welcome but essential. In recent years, the NCI Genomic Data Commons has begun to offer BAM slicing as an option for authorized users of controlled-access data. Particularly for diseases where samples are scarce or logistic hurdles have delayed whole-genome sequencing (WGS) of subjects, an alternative path to validation of rare expressed genetic variants exists: transcriptomic epidemiology, using the `GenomicDataCommons` Bioconductor package to automate downloads, variant calling, and analysis. We demonstrate pilot results from this strategy for the TARGET Pediatric AML project, where we identify rare but recurrent somatic variants in approximately 2% of cases with normal cytogenetics, _prima facie_ low risk stratification, and dismal outcomes. We further illustrate the use of this strategy in expanding the scope for validation of expressed germline risk variants by an order of magnitude beyond existing gold-standard WGS results, the latter funded by the Gabriella Miller Kids First! (GMKF) project. We note that GMKF! and the INCLUDE consortium, which aims to study the full spectrum of consequences and quality of life determinants in people affected by Down syndrome, make similar tools available and have indicated that support for BAM slicing may be added in the near future. This approach represents a practical application of Bioconductor tools to rare disease epidemiology, which we find underutilized at present. This presentation will combine practical notes from live-fire implementation of the strategy in thousands of subjects with few covered by WGS, in contrast to projects like TCGA where most subjects have both transcriptome and genome sequencing results. We will illustrate the role of partial WGS and full WTS to illuminate affected transcript isoforms and refine tissue-specific consequences of variants, as well as selective pressure acting on wild-type, germlines, and somatic variant cells over time and treatment. We conclude with an overview of variant calling pipelines available for downstream analysis of sliced BAMs along with notes on implementation for germline and somatic calls. This presentation may be useful for both clinical and research genetic epidemiology, as well as users of NIH and NCI cloud projects and comparable resources.