strand bias filtering with bcftools
1
0
Entering edit mode
4.8 years ago
user31888 ▴ 100

I called SNPs with samtools/bcftools from RNASeq data as follows, and then try to keep variants showing a strand bias p-value < 0.01 (Fischer's exact test).

#CALLING
samtools mpileup \
--fasta-ref my_ref.fasta \
--min-MQ 1\
--min-BQ 30 \
--uncompressed \
--skip-indels \
input.bam \
| \
bcftools call \
--consensus-caller \
--variants-only \
--skip-variants indels \
--output-type v \
--output call.vcf

#FILTERING
bcftools filter \
--include INFO/PV4[0] < 0.01 \
--mode x \
--output call_filtered.vcf

• Since some records are missing the INFO/PV4 tag, would it be the same to filter the variants based on the FORMAT/SP value?

According to the vcf header, SP is the "Phred-scaled strand bias P-value", whereas PV4[0] is the "P-values for strand bias".

Would it mean that PV4[0] < 0.01 is equivalent to SP > 20 (Q = -10*log10(0.01) = 20)?

Are PV4[0] and SP both calculated with Fischer's exact test?

• You can also read here that GATK keeps SNPs with strand bias p-value < 60 for filtering SNP (equivalent to p-value > 0.000001). But they also mention here that > A higher score indicates a higher probability that a particular > decision is correct, while conversely, a lower score indicates a > higher probability that the decision is incorrect.

In this case why they do not keep SNPs > 60 instead of < 60 (in my case above p-val < 0.01 should be equivalent to a strand bias Phred > 20 instead of < 20) ?

bcftools samtools Phred • 3.8k views
0
Entering edit mode
4.0 years ago

Would it mean that PV4[0] < 0.01 is equivalent to SP > 20 (Q = -10*log10(0.01) = 20)?

In bioinformatics, never assume anything without concrete evidence. So, I would not make this assumption without seeing the exact calculations that are being performed in order to generate both of these metrics.

I do believe that Fisher's test is being used, but I cannot confirm. Your better option may be to look in the manual for samtools or to contact Heng Li. who developed BWA and SAMtools / BCFtools.

## --------------------------------------------

You can also read here that GATK keeps SNPs with strand bias p-value < 60 for filtering SNP (equivalent to p-value > 0.000001). But they also mention here that > A higher score indicates a higher probability that a particular > decision is correct, while conversely, a lower score indicates a > higher probability that the decision is incorrect.

These statements do not contradict each other. By keeping variants that are < Phred-scaled P value 60, only variants showing low evidence of strand bias are retained. The other statement is then just saying that a higher Phred-scaled P value score indicates that the "decision" that strand bias exists is indeed correct.

Kevin