I called SNPs with samtools/bcftools from RNASeq data as follows, and then try to keep variants showing a strand bias p-value < 0.01 (Fischer's exact test).
#CALLING samtools mpileup \ --fasta-ref my_ref.fasta \ --min-MQ 1\ --min-BQ 30 \ --uncompressed \ --output-tags AD,INFO/AD,DP,SP \ --skip-indels \ input.bam \ | \ bcftools call \ --consensus-caller \ --variants-only \ --skip-variants indels \ --output-type v \ --output call.vcf #FILTERING bcftools filter \ --include INFO/PV4 < 0.01 \ --mode x \ --output call_filtered.vcf
- Since some records are missing the INFO/PV4 tag, would it be the same to filter the variants based on the FORMAT/SP value?
According to the vcf header, SP is the "Phred-scaled strand bias P-value", whereas PV4 is the "P-values for strand bias".
Would it mean that
PV4 < 0.01 is equivalent to
SP > 20 (Q = -10*log10(0.01) = 20)?
Are PV4 and SP both calculated with Fischer's exact test?
- You can also read here that GATK keeps SNPs with
strand bias p-value < 60for filtering SNP (equivalent to
p-value > 0.000001). But they also mention here that > A higher score indicates a higher probability that a particular > decision is correct, while conversely, a lower score indicates a > higher probability that the decision is incorrect.
In this case why they do not keep SNPs > 60 instead of < 60 (in my case above p-val < 0.01 should be equivalent to a strand bias Phred > 20 instead of < 20) ?