Question

Variant calling pacbio bacteria

0

Entering edit mode

4.7 years ago

firestar ★ 1.6k

I have PacBio reads from three bacterial genomes as three samples. They were de-novo assembled using HGAP4 and polished. One of samples was taken as reference. And subreads from each sample was mapped back to this reference for variant calling. Variant calling was performed using two workflows:

subreads -> ngmlr -> Sniffles -> VCF
subreads -> pbmm2 -> pbsv -> VCF

Both VCF files have no DP, MQ or any of the standard quality metrics. Is this normal?

How can we access the quality of these variants? Visual examination seems tricky due to huge number of indels in the BAM file.
Another point is that I couldn't find any ploidy option in any of these tools. Is that not important? I am guessing these tools are developed for diploids.

Here is the PBSV vcf (696 rows in total)

pb_501_001  8   pbsv.CNV.12 T   <CNV>   .   NearContigEnd   SVTYPE=cnv;END=132000;SVLEN=131992;SHADOWED CN  3
pb_501_001  8   pbsv.BND.pb_501_001:8-pb_501_001:5167147    T   ]pb_501_001:5167147]T   .   NearContigEnd   SVTYPE=BND;CIPOS=-7,95;MATEID=pbsv.BND.pb_501_001:5167147-pb_501_001:8;MATEDIST=5167139 GT:AD:DP    1/1:0,343:343

Here is the VCF from Sniffles (891 rows in total)

pb_501_001  1   0   N   <DUP>   .   PASS    SUPP=3;SUPP_VEC=111;SVLEN=5167155;SVTYPE=DUP;SVMETHOD=SURVIVOR1.0.7;CHR2=pb_501_001;END=5167156;CIPOS=0,0;CIEND=0,0;STRANDS=-+  GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 1/1:NA:5167155:0,136:-+:.:DUP:0:NA:NA:pb_501_001_1-pb_501_001_5167156   1/1:NA:5167155:0,219:-+:.:DUP:0:NA:NA:pb_501_001_1-pb_501_001_5167156   1/1:NA:5167155:0,100:-+:.:DUP:0:NA:NA:pb_501_001_1-pb_501_001_5167156
pb_501_001  1612    1   N   <INS>   .   PASS    SUPP=3;SUPP_VEC=111;SVLEN=321;SVTYPE=INS;SVMETHOD=SURVIVOR1.0.7;CHR2=pb_501_001;END=1612;CIPOS=0,0;CIEND=0,0;STRANDS=+- GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/0:NA:321:24,2:+-:.:INS:1:NA:NA:pb_501_001_1612-pb_501_001_1612    0/0:NA:321:48,2:+-:.:INS:1:NA:NA:pb_501_001_1612-pb_501_001_16120/0:NA:321:19,3:+-:.:INS:1:NA:NA:pb_501_001_1612-pb_501_001_1612

These are the types of variants in the sniffles VCF.

cat sniffles.vcf | grep -v "^#" | cut -f5 | sort | uniq -c
    101 <DEL>
     86 <DUP>
     12 <DUP/INS>
    340 <INS>
    162 <INV>
    154 <INVDUP>
     36 <INV/INVDUP>

I had to use subreads because I wasn't able to generate CCS from this data. Are there any other tools to call variants (mostly SNPs) from bacterial pacbio data? I tried PBHoover, but unfortunately it just crashes when I run it.

pacbio variant-calling bacteria sniffles pbsv • 1.9k views

ADD COMMENT • link updated 4.6 years ago by William Rowell ▴ 280 • written 4.7 years ago by firestar ★ 1.6k

score 0 · Answer 1 · 2019-09-20

For pbsv, DP is reported for all variant types except for CNV. You can see DP in the FORMAT field for the example you provided:

pb_501_001  8   pbsv.BND.pb_501_001:8-pb_501_001:5167147    T   ]pb_501_001:5167147]T   .   NearContigEnd   SVTYPE=BND;CIPOS=-7,95;MATEID=pbsv.BND.pb_501_001:5167147-pb_501_001:8;MATEDIST=5167139 GT:AD:DP    1/1:0,343:343

While pbsv is designed for diploid samples, you can identify variants in haploid organisms by either post filtering for GT 1/1, or by increasing the percent of supporting reads in pbsv call.

  -P,--call-min-read-perc-one-sample          Ignore calls supported by < P% of reads in every sample. [20]

You may need to experiment a bit here, but setting this to 80% or higher should give you the results you expect without sacrificing sensitivity.