Calling Variants from RNA-seq data

An updated approach to variant detection in transcriptomes

The current gold standard method of choice to study transcriptome expression is RNA-seq. This is commonly used to study gene expression patterns in a number of organisms; including human, animals and plants, in order to further understand genetic mechanisms underlying phenotype determination, diseases and results of environmental changes. Not only can RNA-seq analysis be used for gene expression, transcript modelling including the identification of novel expressed isoforms or alternative exon usage has become more popular. Examination of the long non-coding RNA atlas (lncRNA) and micro RNAs (miRNA) can readily be carried out using specific RNA-seq approaches. Combined with DNA-seq SNPs have been studied with feature expression data to identify expression quantitative trait locus (eQTL) via GWAS mapping or allele-specific expression (ASE) (for example the human GTEx Consortium, 2020).

With the phenomenon of RNA editing, where a nucleotide change can be observed at the RNA level, post DNA transcription, it is possible that variants can only be observed at the RNA level, absent at DNA level (such as WGS or WES). Therefore RNA-seq can also detect genomic variants in expressed regions of the genome, in addition to DNA-seq. With the current costs of RNA-seq per sample around £300 for 75M paired end reads (150bp), RNA-seq is an attractive approach for variant discovery.

Although there are clear advantages in using RNA-seq data for coding region SNP detection, it is not commonly carried out. There are a number of hurdles to overcome:

  • transcriptome consists of mature/spliced transcripts
  • short reads frequently overlap exon-exon junctions
  • reads are commonly split using N cigars

There are distinct stages of analysis for variant calling from RNA-seq data. I will try to encompass all of them, so please skip ahead to sections you require.

Assumptions I will make: you have the raw reads processed for base scoring and quality controlled, removed adapters, a reference genome has been downloaded and indexes created for the STAR splice-aware aligner, reads have been aligned using the index with two-pass alignment using STAR.

Key notes: ensure that ReadGroups have been added to your resulting BAM files - this can either be done within the aligner command with the argument --outSAMattrRGline ID:${sample_id} CN:{SampleCentre} DS:RNAseq LB:{LibraryType} PL:{Platform} SM:${sample_id}

I am going to skip over the initial steps of quality control (QC) and alignment with STAR, as those are outlined in the RNA-seq workflow I have on github. There are however some key files that should have been generated when processing the reference genome fasta (.fa) and aligner index files. Ensure that the reference .fa is indexed correctly, I recommend samtools faidx and that a dictonary (.dict) file has been created, for example with the picard CreateSequenceDictionary tool.

Analysis:

  1. QC

  2. Alignment

  3. Split N Cigar Reads

  4. Base quality recalibration

  5. Call variants

  6. Base Quality Recalibration

-BLURB-

One of the hurdles that you may encounter using the


pwd lets go
Nicholas Owen
Nicholas Owen
Research Fellow and Bioinformatician

My interests include rare disease translational research using ‘omics analysis.