Genotyping from RNA-Seq: how to recalibrate mapping quality scores in BAM file?
2
0
Entering edit mode
6.4 years ago
Leszek 4.1k

I have aligned RNA-Seq reads with STAR and I want to genotype the samples using RNA-Seq reads. I have genotyped from RNA-Seq experiments before, but it was aligned with tophat2. But this time the size of the dataset is quite too big to run tophat2...

The problem is that STAR marks unique alignments with maximal mapping qualities (mapq=255), which causes problems with genotyping tools ie GATK. And GATK developers recommend to calibrate mapq prior to genotyping, anyway.

1. Can you recommend me some tools for recalibration of mapq and base qualities?
2. Anyone has some experience with genotyping from RNA-Seq?
3. How does genotyping from STAR compares to genotyping from tophat2?
4. And is genotyping from DNA-Seq comparable to genotyping from RNA-Seq? Shall I take some special precautions steps?

I would be grateful for any hints!

recalibrate mapq bam STAR RNA-Seq • 2.9k views
2
Entering edit mode
6.4 years ago

1. The SplitNCigarReads tool that is primarily meant for splitting a read with N cigar into individual exon segments and hard clipping of any overhanging reads into the intronic regions, also allows you to reassign the mapping quality. The below command has been taken from GATK website. But it will only work with MAPQ. Why would you want to reassign base qualities? If you meant base quality recalibration, BQSR in GATK can be used.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

2. Yes, I have quite a bit of experience :-). But I don't know what exactly you want me to talk about. Genotyping usign RNAseq works fine but you are interrogating less than two percent of the whole genome. In short, you may not be able to get enough markers/genotypes to perform a association study with much precision. We perform RNAseq genotyping to check for strain and sex assignment errors that is pretty common when you do big experiments involving 200-500 mice. Mendelian genes show clear segregation and help to rectify the strain assignment. For sex, you may use genes like Xist (Female) or Ddx3y (Male).

3. Here is a text from GATK website "For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner". I have never compared them but I believe that both STAR and Tophat2 will give you more or less same results for genotyping.

4. The methods are comparable but for RNA-seq you should use a splice aware aligner and also try SplitNCigarReads that I have talked above. You should also not use soft clipped bases for variant calling to minimize the false positive variants for obvious reasons. All these steps are part of best practices for RNAseq variant calling as described by GATK here (https://www.broadinstitute.org/gatk/guide/article?id=3891). May be you haven't seen it.

0
Entering edit mode

Thanks a lot! That's really helpful post!

0
Entering edit mode
6.4 years ago

If you want to genotype from RNA-seq data, I suggest you use BBMap, which is more sensitive to both SNPs and indels than STAR or TopHat2.  Genotyping from RNA-seq is of course not the same as DNA, as you're never assured of getting equal coverage from alleles, or indeed any coverage from one allele.  Or, indeed, any coverage from most genes.