Question: strand bias filtering with bcftools
0
gravatar for user31888
2.0 years ago by
user3188840
United States
user3188840 wrote:

I called SNPs with samtools/bcftools from RNASeq data as follows, and then try to keep variants showing a strand bias p-value < 0.01 (Fischer's exact test).

#CALLING
samtools mpileup \
--fasta-ref my_ref.fasta \
--min-MQ 1\
--min-BQ 30 \
--uncompressed \
--output-tags AD,INFO/AD,DP,SP \
--skip-indels \
input.bam \
| \
bcftools call \
--consensus-caller \
--variants-only \
--skip-variants indels \
--output-type v \
--output call.vcf

#FILTERING
bcftools filter \
--include INFO/PV4[0] < 0.01 \
--mode x \
--output call_filtered.vcf
  • Since some records are missing the INFO/PV4 tag, would it be the same to filter the variants based on the FORMAT/SP value?

According to the vcf header, SP is the "Phred-scaled strand bias P-value", whereas PV4[0] is the "P-values for strand bias".

Would it mean that PV4[0] < 0.01 is equivalent to SP > 20 (Q = -10*log10(0.01) = 20)?

Are PV4[0] and SP both calculated with Fischer's exact test?

  • You can also read here that GATK keeps SNPs with strand bias p-value < 60 for filtering SNP (equivalent to p-value > 0.000001). But they also mention here that > A higher score indicates a higher probability that a particular > decision is correct, while conversely, a lower score indicates a > higher probability that the decision is incorrect.

In this case why they do not keep SNPs > 60 instead of < 60 (in my case above p-val < 0.01 should be equivalent to a strand bias Phred > 20 instead of < 20) ?

phred samtools bcftools • 1.7k views
ADD COMMENTlink modified 14 months ago by Kevin Blighe41k • written 2.0 years ago by user3188840
0
gravatar for Kevin Blighe
14 months ago by
Kevin Blighe41k
Kevin Blighe41k wrote:

Would it mean that PV4[0] < 0.01 is equivalent to SP > 20 (Q = -10*log10(0.01) = 20)?

In bioinformatics, never assume anything without concrete evidence. So, I would not make this assumption without seeing the exact calculations that are being performed in order to generate both of these metrics.

I do believe that Fisher's test is being used, but I cannot confirm. Your better option may be to look in the manual for samtools or to contact Heng Li. who developed BWA and SAMtools / BCFtools.

--------------------------------------------

You can also read here that GATK keeps SNPs with strand bias p-value < 60 for filtering SNP (equivalent to p-value > 0.000001). But they also mention here that > A higher score indicates a higher probability that a particular > decision is correct, while conversely, a lower score indicates a > higher probability that the decision is incorrect.

These statements do not contradict each other. By keeping variants that are < Phred-scaled P value 60, only variants showing low evidence of strand bias are retained. The other statement is then just saying that a higher Phred-scaled P value score indicates that the "decision" that strand bias exists is indeed correct.

Kevin

ADD COMMENTlink written 14 months ago by Kevin Blighe41k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1513 users visited in the last hour