Question: How to work with SNPs when low coverage
1
gravatar for luzglongoria
11 months ago by
luzglongoria40
luzglongoria40 wrote:

Hi there,

I am working with RNA-seq of a organism (Plasmodium) that does not have reference genome. Which is readily available is a genome of a very related species. So I willl use the reads and the reference genome for determining the SNPs and differences between these two species. The problem I am considering is that since I am working with RNA-seq not all the genes will be expressed and in most of the cases some genes (like the one in the middle --- picture) would get zero SNPs just because they are not expressed or to low coverage. If there are accumulation of such genes it might look like that there are no SNP in those genes but in fact I just don't know.

Screen-Shot-2019-12-05-at-10-34-54
avatar para foros

What is the proper way of dealing with this issue? Maybe to choose a threshold? and in this case...how to decide which one?

Thank you so much in advance

coverage snp rna-seq • 251 views
ADD COMMENTlink written 11 months ago by luzglongoria40

You can downsample your dataset and determine precision and recall for calling of variants comparing vs. full-coverage dataset.

I'd wont try to approximate amount of SNVs you miss in the low-covered regions - this lower coverage may correlate with DNA-accessibility and it is well known that there is a correlation between DNAse-accessible regions and amount of SNVs observed there.

You may also want to check out this tool: https://academic.oup.com/gigascience/article/8/9/giz100/5559527

ADD REPLYlink written 11 months ago by German.M.Demidov1.9k

Thank you so much for your response. I have already done some analysis and get a .vcf file with the calling of variants. Shall I compare this data with full-coverage dataset? And how can I do that?

ADD REPLYlink written 11 months ago by luzglongoria40
1

No, in theory you should prepare vcf will your initial dataset, then downsample it like 10%, then check if you can still retrieve the same SNVs, and stop when you understand that you coverage is not enough. This will be your limit and you'll have to discard all the regions from the inital dataset which are covered less than this value. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4296149/ - here I guess they have it described (but I am not sure)

ADD REPLYlink written 11 months ago by German.M.Demidov1.9k
1

Thank you so much. I think the paper you sent about the subSeq R package can help me :)

ADD REPLYlink written 11 months ago by luzglongoria40

┬┐No es / it's not P. falciparum? Have you done de novo transcriptome assembly?

ADD REPLYlink written 11 months ago by Kevin Blighe67k

It is Plasmodium relictum lineage GRW4. An avian malaria parasite. And yes, I have done de novo transcriptome assembly.

ADD REPLYlink written 11 months ago by luzglongoria40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1352 users visited in the last hour