How to work with SNPs when low coverage
0
1
Entering edit mode
4.4 years ago
luzglongoria ▴ 50

Hi there,

I am working with RNA-seq of a organism (Plasmodium) that does not have reference genome. Which is readily available is a genome of a very related species. So I willl use the reads and the reference genome for determining the SNPs and differences between these two species. The problem I am considering is that since I am working with RNA-seq not all the genes will be expressed and in most of the cases some genes (like the one in the middle --- picture) would get zero SNPs just because they are not expressed or to low coverage. If there are accumulation of such genes it might look like that there are no SNP in those genes but in fact I just don't know.

Screen-Shot-2019-12-05-at-10-34-54
avatar para foros

What is the proper way of dealing with this issue? Maybe to choose a threshold? and in this case...how to decide which one?

Thank you so much in advance

RNA-Seq SNP coverage • 1.2k views
ADD COMMENT
0
Entering edit mode

You can downsample your dataset and determine precision and recall for calling of variants comparing vs. full-coverage dataset.

I'd wont try to approximate amount of SNVs you miss in the low-covered regions - this lower coverage may correlate with DNA-accessibility and it is well known that there is a correlation between DNAse-accessible regions and amount of SNVs observed there.

You may also want to check out this tool: https://academic.oup.com/gigascience/article/8/9/giz100/5559527

ADD REPLY
0
Entering edit mode

Thank you so much for your response. I have already done some analysis and get a .vcf file with the calling of variants. Shall I compare this data with full-coverage dataset? And how can I do that?

ADD REPLY
1
Entering edit mode

No, in theory you should prepare vcf will your initial dataset, then downsample it like 10%, then check if you can still retrieve the same SNVs, and stop when you understand that you coverage is not enough. This will be your limit and you'll have to discard all the regions from the inital dataset which are covered less than this value. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4296149/ - here I guess they have it described (but I am not sure)

ADD REPLY
1
Entering edit mode

Thank you so much. I think the paper you sent about the subSeq R package can help me :)

ADD REPLY
0
Entering edit mode

¿No es / it's not P. falciparum? Have you done de novo transcriptome assembly?

ADD REPLY
0
Entering edit mode

It is Plasmodium relictum lineage GRW4. An avian malaria parasite. And yes, I have done de novo transcriptome assembly.

ADD REPLY

Login before adding your answer.

Traffic: 2275 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6