Genotype calling from Recount3 RNA-seq data

Genotype calling from Recount3 RNA-seq data


Author(s): Afrooz Razi,Christopher C. Lo,Sirou Wang,Jeffrey T. Leek,Kasper Daniel Hansen

Affiliation(s): Department of Human Genetics, Johns Hopkins University School of Medicine. Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health. Biostatistics Program, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center



Genotype calling from Recount3 RNA-seq data Advances in high throughput sequencing technologies have enabled gene expression studies of human health and disease. Large amounts of transcriptome data jabe been uploaded to public repositories in the last decade. Access to existing data promotes reproducibility and reusability, however, processing raw RNA seq data is time consuming and resource demanding. The Recount3 repository has resolved this issue by providing public access to processed RNA-seq data. To do so, the Recount3 aggregated all available human RNA-sequencing data and provides access to 316,443 uniformly processed bulk human RNA-seq data from Genotype Tissue Expression (GTEx), The Cancer Genome Atlas (TCGA), and Sequence Read Archive (SRA). However, the sample genotype information is missing in most studies. We have developed a simple, yet robust statistical model for genotyping RNA-seq samples. Our model is fast and only uses raw read counts from the reference and alternative alleles whereas previous variant callers require alignment files which are missing from Recount3. We have successfully genotyped all the Recount3 samples and completed the largest RNA-seq repository with matched genotype information to date. Our model was able to predict the genotype at 98.2% accuracy in our evaluation set. Work on sharing this resource is ongoing; it contains private genotype information. This resource can be used in large scale allele specific expression and eQTL analysis in future studies.