Question: SNP at every chromosome position
0
gravatar for waqasnayab
3.5 years ago by
waqasnayab180
Pakistan
waqasnayab180 wrote:

Hi,

I have SOLiD sequencing data. After aligning with SHRiMP2, I used samtools mpileup for SNP calling: samtools mpileup -C50 -gf hg38.fa -o var.raw.bcf input.bam bcftools call -o var.raw.vcf -O v -c var.raw.bcf The raw vcf file looks like this:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SINDHI

chr1 33953 . T . 56.8087 PASS DP=11;MQSB=0.950952;MQ0F=1;AF1=0;AC1=0;DP4=6,5,0,0;MQ=0;FQ=-59.9998 GT:PL 0/0:0

chr1 33954 . C . 56.8087 PASS DP=11;MQSB=0.950952;MQ0F=1;AF1=0;AC1=0;DP4=6,5,0,0;MQ=0;FQ=-59.9998 GT:PL 0/0:0

chr1 33955 . T . 56.4609 PASS DP=10;MQSB=0.952347;MQ0F=1;AF1=0;AC1=0;DP4=5,5,0,0;MQ=0;FQ=-56.997 GT:PL 0/0:0

chr1 35396 . C . 56.4609 PASS DP=10;MQSB=0.952347;MQ0F=1;AF1=0;AC1=0;DP4=5,5,0,0;MQ=0;FQ=-56.997 GT:PL 0/0:0

chr1 35397 . C . 61.2368 PASS DP=12;MQSB=0.95494;MQ0F=1;AF1=0;AC1=0;DP4=5,7,0,0;MQ=0;FQ=-62.9905 GT:PL 0/0:0

chr1 35398 . C . 61.2368 PASS DP=12;MQSB=0.95494;MQ0F=1;AF1=0;AC1=0;DP4=5,7,0,0;MQ=0;FQ=-62.9905 GT:PL 0/0:0

chr1 35399 . A . 61.2368 PASS DP=12;MQSB=0.95494;MQ0F=1;AF1=0;AC1=0;DP4=5,7,0,0;MQ=0;FQ=-62.9905 GT:PL 0/0:0

chr1 35400 . A . 61.2368 PASS DP=12;MQSB=0.95494;MQ0F=1;AF1=0;AC1=0;DP4=5,7,0,0;MQ=0;FQ=-62.9905 GT:PL 0/0:0

chr1 35401 . C . 61.2368 PASS DP=12;MQSB=0.95494;MQ0F=1;AF1=0;AC1=0;DP4=5,7,0,0;MQ=0;FQ=-62.9905 GT:PL 0/0:0 As

you can see, the chromosome position is continuous. I know that these are raw variant file but it contains 333,399,862 variants (almost as the number of bases). So, how can I filter this (there are so many false positives), i need filtration or should i do something at mpileup or bcftools call stage?

myposts • 1.2k views
ADD COMMENTlink modified 3.5 years ago by Manuel Landesfeind1.2k • written 3.5 years ago by waqasnayab180
2
gravatar for Manuel Landesfeind
3.5 years ago by
Göttingen, Germany
Manuel Landesfeind1.2k wrote:

In the way you executed samtools/bcftools, it will print out every position with a pileup, i.e., at least one mapped read. To print only variants where there is some evidence for a variant use the '-v' command line parameter:

$ bcftools call
[...]
Input/output options:
[...]
  -v, --variants-only             output variant sites only
[...]

You can also see that you have actually no variants predicted as the column for alternative sequence (ALT) is empty (".") and all mapped reads at that position support the reference sequence (e.g., INFO column DP4="6,5,0,0" does mean 6 reads on the forward and 5 reads on the reverse strand support the reference while none support the/an alternative).
 

Important: Still there will be a large number of false positive calls, so you will have to use some filtering procedure, e.g., removing variants with low coverage or low allele frequency. For this, have a look at `bcftools filter` and select suitable thresholds. I think, there is no ground-truth for suitable thresholds and people use different setting according on their needs. Probably, you find some paper doing a similar analysis or on a similar organism or similar study or whatsoever and you can adopt their setting to be comparable.

 

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Manuel Landesfeind1.2k

Like! Its great anwer. thx

ADD REPLYlink written 3.5 years ago by pshopich0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1035 users visited in the last hour