Increased proportion of heterozygous SNPs when RNAseq reads are mapped against a transcriptome
2
1
Entering edit mode
17 months ago
thal ▴ 10

I have paired-end RNAseq data (Illumina). I have assembled a transcriptome with SPAdes. Now I want to call SNPs. There is no reference genome available yet for the species I am actually working with. That's why I thought I map the reads against the transcriptome (with bwa) to call SNPs (with bcftools mpileup) for some preliminary tests (3 samples). This yielded ~90% heterozygous SNPs after quality filtering (QUAL>30, DP>30, MQ>30, MQB>0.4, GQ>30). Since the species is a predominantly self fertilizing plant, this is way too high.

To check if something is wrong with this specific species, I did the same procedure for another selfing species for which a reference is available (same sequencing project, again 3 samples). I got ~25% heterozygous SNPs when I mapped the reads against the transcriptome and 3% when I mapped them against the reference genome (with STAR, only uniquely mapped reads (255) retained). 3% is realistic and fits to previous results, 25% is again way too high.

So, I am wondering what is going on. The only thing I can come up with is, that the two different aligners cause this difference. But in both cases 90-95% of the reads mapped to the genome/transcriptome and in both cases only reads that were mapped with high confidence were retained. Why do the two procedures produce two so different levels of heterozygousity?

I' be happy about any input, since I am stuck.

transcriptome SNP RNA-seq • 1.3k views
ADD COMMENT
0
Entering edit mode

hi i am looking forward to map SNPs from transcriptome data. refernce genome is also available. but can you point me to a beginners guide. where to start? and list of tools that i will need? i am a beginner in NGS please

ADD REPLY
0
Entering edit mode

Please do not post new questions as answers in existing threads. If you do a google search with rnaseq snp site:biostars.org you will find existing threads. If they don't provide adequate information then please post a new question.

ADD REPLY
1
Entering edit mode
17 months ago
liorglic ★ 1.4k

Interesting question, related to the common question of whether or not using a transcriptome as reference in RNA-seq analysis is a good idea.

I am not sure what's happening here, but I suspect this has to do with gene paralogs or small-scale duplications within transcripts. I am guessing that using splice-aware aligners and a genome reference allows for more accurate mapping, so reads are mapped to the correct gene rather than to a paralog.

Something similar has been reported here with regard to cryptic gene duplications. I guess using a transcriptome makes things even worse.
Maybe you can try to filter your bam files to only include perfectly-mapped reads (i.e. the full length of the read + max of 1 mismatch), then call SNPs from these filtered files and see if you still get so much heterozygosity. This will probably discard a lot of data, and also will not help if the paralogs are very similar, but still worth trying.

Another question - when assembling the transcriptome, did you use rnaspades? Because "standard" spades may not work so well for RNA-seq data.

Finally, if you have the option to produce long-read data (e.g. PacBio IsoSeq), then you'll probably have an easier time both assembling the transcriptome and calling variants.

ADD COMMENT
0
Entering edit mode

Thanks a lot for your input.

I think that the main problem is indeed cryptic duplications, as suggested by liorglic. Similar to the paper he referred to, I see tracks of heterozygousity in my data, i.e. entire transcripts with only heterozygous SNPs. In my view, this suggests that these are duplicated in the genome, but the assembler (yes, I used rnaspades) recovered only one copy and the reads from the paralog map to this one copy, which results in spurious heterozygous SNP calling. Other types of RNA (lncRNA etc as suggested by colindaven) are most likely adding to this issue, although I think paralogs missing in the transcriptome are the bigger issue.

That also means that more stringent filtering is not a great thing to do. While I can probably get rid of some of the spurious SNPs, I will still retain some of them and also lose correct ones.

I think that colindaven is right, when he said "a transcriptome is not sufficient for assessing hets". At least not a transcriptome that was created as quick and dirty as mine. Luckily, we are currently sequencing the genome, so I just have to be a more patient...

ADD REPLY
0
Entering edit mode

You can also give Trinity a try. It might do a better job resolving paralogs. Note that it also has a genome-guided mode that you can use once you have your genome assembly.

ADD REPLY
1
Entering edit mode
17 months ago

You're withholding genomic information from the aligner, and the aligner will try hard to align reads to the transcriptome, even if they are from genomic regions (think off target transcription, lncRNAs etc etc etc).

As you correctly reason, a transcriptome is not sufficient for assessing hets. Remember also that calling hets is much harder than calling homozygous SNPs.

I'm not sure just using perfectly mapped reads will help (then we won't find SNPs), or relying on mapping quality estimates from RNA-seq aligners (notoriously poor).

I think you need a genome to answer this question.

ADD COMMENT
0
Entering edit mode

You're right about the perfect mapping of course. I edited my answer accordingly.

ADD REPLY

Login before adding your answer.

Traffic: 2526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6