Freebayes Very Huge Snp Report File Size
2
0
Entering edit mode
10.2 years ago
Jirapong ▴ 20

I'm trying to run FreeBayes with Human genomic sample. Here is my command.

freebayes --no-indels --no-mnps --no-complex  -v snp0.vcf -p 2 -f Homo_sapiens.nucleotide.fa -b sample1.bam


BAM file as SAM format (output from BWA)

HWI-ST313_0162:7:46:20191:40694#CTTGTA    177    Y    2786195    0    100M    6    73049336    0    CATAGTATTCCATGGTGTATATGTGCCACATTTTCTTCATCCAGTCTATCATTGNTGGACATTTGGGTTGGTTCCAAGTCTTTGCTATTGTGAATAGTGC    ggffedccb_fcadefdcfdfceaedSddehdafgfggdgcggddddfd]]]]BbT]bgaggbggggggeggggegggggggggffgggfgggggggg    XT:A:R    NM:i:1    SM:i:0    AM:i:0    X0:i:38    XM:i:1    XO:i:0    XG:i:0    MD:Z:54T45
HWI-ST313_0162:7:63:14158:88065#CTTGTA    113    Y    2925844    0    100M    =    2925844    0    AAAGGAGGCATCTCAAAGGAAATGGAATTTAGTTGAGCTGAAGGATAAGAATTAGATTGCACTGTATTAAAAGTTGGTGAAGGGCTTCCCAGGCAAAGAA    gdgggcedegfacfcggggggcgggefecffggggfdgeggfgggegggggggffgeeggcgggeggfggggggffgffegfgfggbggggggggggfgf    XT:A:R    NM:i:71    SM:i:0    AM:i:0    X0:i:2    X1:i:0    XM:i:1    XO:i:0    XG:i:0    MD:Z:0T1T1T0T1C2C0A0G0A1C0T0T0T0T0G0C0A0A0A0T0C1G2T1A0T2A0A0C0T0T0A0A0C0A0C1T0T0G1A0G0T0T0A1T0T0G0T0A1A0C0G2T0T0T2C0T0C1T0T0C0A1A0T1G0A0G1T0A0C2G3G0    XA:Z:X,-88657279,100M,1;


When result generated, it got all position as SNP.

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    unknown
Y    3044167    .    G    T    14.6236    .    AB=0.25;ABP=5.18177;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=4;DPRA=0;EPP=5.18177;EPPR=3.0103;HWE=-0;LEN=1;MEANALT=2;MQM=37;MQMR=37;NS=1;NUMALT=1;ODDS=0;PAIRED=0;PAIREDR=0;RO=2;RPP=5.18177;RPPR=3.0103;RUN=1;SAP=5.18177;SRP=7.35324;TYPE=snp;XAI=0;XAM=0.81;XAS=0.81;XRI=0;XRM=0.4;XRS=0.4;BVAR    GT:DP:RO:QR:AO:QA:GL0/1:4:2:90:1:33:-6.27,-3.72597,-11.48
Y    3044168    .    G    T    13.2124    .    AB=0.25;ABP=5.18177;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=4;DPRA=0;EPP=5.18177;EPPR=3.73412;HWE=-0;LEN=1;MEANALT=1;MQM=37;MQMR=37;NS=1;NUMALT=1;ODDS=2.99336;PAIRED=0;PAIREDR=0;RO=3;RPP=5.18177;RPPR=3.73412;RUN=1;SAP=5.18177;SRP=9.52472;TYPE=snp;XAI=0;XAM=0.79;XAS=0.79;XRI=0;XRM=0.276667;XRS=0.276667;BVAR    GT:DP:RO:QR:AO:QA:GL    0/1:4:3:133:1:33:-3.3,-0.60206,-12.4133
Y    3044169    .    T    G    13.2124    .    AB=0.25;ABP=5.18177;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=4;DPRA=0;EPP=5.18177;EPPR=3.73412;HWE=-0;LEN=1;MEANALT=1;MQM=37;MQMR=37;NS=1;NUMALT=1;ODDS=2.99336;PAIRED=0;PAIREDR=0;RO=3;RPP=5.18177;RPPR=3.73412;RUN=1;SAP=5.18177;SRP=9.52472;TYPE=snp;XAI=0;XAM=0.79;XAS=0.79;XRI=0;XRM=0.276667;XRS=0.276667;BVAR    GT:DP:RO:QR:AO:QA:GL    0/1:4:3:135:1:33:-3.3,-0.60206,-12.6
Y    3044170    .    A    T    13.2124    .    AB=0.25;ABP=5.18177;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=4;DPRA=0;EPP=5.18177;EPPR=3.73412;HWE=-0;LEN=1;MEANALT=1;MQM=37;MQMR=37;NS=1;NUMALT=1;ODDS=2.99336;PAIRED=0;PAIREDR=0;RO=3;RPP=5.18177;RPPR=3.73412;RUN=1;SAP=5.18177;SRP=9.52472;TYPE=snp;XAI=0;XAM=0.81;XAS=0.81;XRI=0;XRM=0.27;XRS=0.27;BVAR    GT:DP:RO:QR:AO:QA:GL    0/1:4:3:162:1:33:-3.3,-0.60206,-15.12


and final file result like > 100 GB (my BAM file is 8 GB). Any advice?

snp • 3.7k views
0
Entering edit mode

turn out if i use with ploidy more than one it will generate alot result, to at this point i limited to use only Haploid.

2
Entering edit mode
10.2 years ago

Your alignments are generating a very large number of changes see:

MD:Z:0T1T1T0T1C2C0A0G0A1C0T0T0T0T0G0C0A0A0A0T0C1G2T1A0T2A0A0C0T0T0A0A0C...

For each of these variations in the CIGAR string you may end up with a separate line in your output. So a single line of your SAM file may generate dozens of lines of SNP calls. Now it is very likely that either your alignment or the way you invoke freebayes is not correct. Right now I am only trying to come up with an explanation of why the resulting file size is so large.

Also note that a BAM file is both binary and compressed, whereas your VCF file is text format and not compressed.

0
Entering edit mode

Thank you for the answer, well, you're right that if more SNP then VCF file is expect to be large. BUT when i do analysis with samtools (mpileup), we have only 151K SNP.

0
Entering edit mode

I would guess that the tool that you are running is not working properly and calls every single sequencing error a SNP

0
Entering edit mode
9.9 years ago
Erik Garrison ★ 2.3k

A basic way of resolving the problem you've experienced is to set the --min-alternate-count (-C) to 2, as this requires two supporting observations to call a variant. The most recent version in github sets this by default to avoid exactly the problem you're experiencing (low coverage and a lot of errors).

Also, what happens when you just run with the default parameters? FreeBayes uses putative complex alleles to filter clusters of errors from reads, so if you remove them you don't get this performance benefit.