Question

Snp Density

0

Entering edit mode

10.3 years ago

int11ap1 ▴ 470

Hi guys,

Well, I have called variants with samtools mpileup, where:

MAPQ=20 # min map quality BASEQ=20 # min base quality

and then filtered them by:

MINCOV=5 # min coverage MAXCOV=30 # maximum coverage

Now, I have to look for the SNP density within the region sequenced. For that, I obtain the depth file (samtools depth).

awk '{nucleotides++} END {print nucleotides}' depthfile.txt
(in order to know the real length)

It should be as simple as divide the number of SNP by the length. However, doing that I would obtain a wrong number of SNP/kb, because I've filtered SNPs by quality and coverage.

So, am I correct if I should filter the depth file by coverage? Quality as well? Or not?

Thanks a lot, guys.

snp samtools coverage • 4.7k views

ADD COMMENT • link updated 10.3 years ago by DG 7.3k • written 10.3 years ago by int11ap1 ▴ 470

0

Entering edit mode

cross posted in seqnaswers

ADD REPLY • link 9.5 years ago by Prakki Rama ★ 2.7k

score 0 · Answer 1 · 2014-01-17

0

Entering edit mode

10.3 years ago

DG 7.3k

Well, it depends a little bit on the question you are asking. But I would think you are asking the question of how many REAL SNPs do I see in my data per kb. If that is the case then absolutely you want to filter by depth and quality first. Those hard filtering approaches are meant to try and remove false-positives from your dataset. Otherwise you are essentially asking the question of the density of SNPs called by the variant caller per kb. Since that number includes false positives it is a less interesting and informative question.

ADD COMMENT • link 10.3 years ago by DG 7.3k

0

Entering edit mode

Thanks for answering. Sure, my SNPs are filtered by quality and coverage. However, I call the command awk to the depth file in order to count how many bases exactly I have (length). So, you think I should count only the bases above 5 of coverage and below 30, such as I've done with SNPs?

ADD REPLY • link 10.3 years ago by int11ap1 ▴ 470

0

Entering edit mode

I'm taking that to mean you are looking at some sort of targeted re-sequencing data then, either Exome or something like Amplicon sequencing, targeted panel, etc?

Part of this really depends on what you are sequencing. I normally would probably have just gone with the length of my targeted regions for sequencing, although it would skew your density if you had targeted regions that were not sequenced adequately. Otherwise yes, if you want to count bases where you had proper sequence coverage only I would use the same Depth/Quality thresholds for filtering as what you used for filtering SNPs.

ADD REPLY • link 10.3 years ago by DG 7.3k

0

Entering edit mode

I agree with Dan.

If you are doing "Exome or something .." your SNPs/kb will be fairly arbitrary, as your fragments may not be representative of the whole chromosome/genome, therefore not useful for external reference. However as long as you are comparing consistently between your own samples, I don't think it really matters for what you use as your denominator.

ADD REPLY • link 10.3 years ago by c.v.oflynn ▴ 100