strand bias filtering with bcftools
1
0
Entering edit mode
7.1 years ago
user31888 ▴ 130

I called SNPs with samtools/bcftools from RNASeq data as follows, and then try to keep variants showing a strand bias p-value < 0.01 (Fischer's exact test).

#CALLING
samtools mpileup \
--fasta-ref my_ref.fasta \
--min-MQ 1\
--min-BQ 30 \
--uncompressed \
--output-tags AD,INFO/AD,DP,SP \
--skip-indels \
input.bam \
| \
bcftools call \
--consensus-caller \
--variants-only \
--skip-variants indels \
--output-type v \
--output call.vcf

#FILTERING
bcftools filter \
--include INFO/PV4[0] < 0.01 \
--mode x \
--output call_filtered.vcf
  • Since some records are missing the INFO/PV4 tag, would it be the same to filter the variants based on the FORMAT/SP value?

According to the vcf header, SP is the "Phred-scaled strand bias P-value", whereas PV4[0] is the "P-values for strand bias".

Would it mean that PV4[0] < 0.01 is equivalent to SP > 20 (Q = -10*log10(0.01) = 20)?

Are PV4[0] and SP both calculated with Fischer's exact test?

  • You can also read here that GATK keeps SNPs with strand bias p-value < 60 for filtering SNP (equivalent to p-value > 0.000001). But they also mention here that > A higher score indicates a higher probability that a particular > decision is correct, while conversely, a lower score indicates a > higher probability that the decision is incorrect.

In this case why they do not keep SNPs > 60 instead of < 60 (in my case above p-val < 0.01 should be equivalent to a strand bias Phred > 20 instead of < 20) ?

bcftools samtools Phred • 5.6k views
ADD COMMENT
0
Entering edit mode
6.2 years ago

Would it mean that PV4[0] < 0.01 is equivalent to SP > 20 (Q = -10*log10(0.01) = 20)?

In bioinformatics, never assume anything without concrete evidence. So, I would not make this assumption without seeing the exact calculations that are being performed in order to generate both of these metrics.

I do believe that Fisher's test is being used, but I cannot confirm. Your better option may be to look in the manual for samtools or to contact Heng Li. who developed BWA and SAMtools / BCFtools.

--------------------------------------------

You can also read here that GATK keeps SNPs with strand bias p-value < 60 for filtering SNP (equivalent to p-value > 0.000001). But they also mention here that > A higher score indicates a higher probability that a particular > decision is correct, while conversely, a lower score indicates a > higher probability that the decision is incorrect.

These statements do not contradict each other. By keeping variants that are < Phred-scaled P value 60, only variants showing low evidence of strand bias are retained. The other statement is then just saying that a higher Phred-scaled P value score indicates that the "decision" that strand bias exists is indeed correct.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1483 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6