Question: Why Is The Variant File Produced By Samtools So Large? It Is About ~34G.
gravatar for Jordan
7.2 years ago by
Jordan1.2k wrote:


I have a bam file of size 62G. When I do the variant analysis by samtools, it gives a huge bcf file like 34G. It's quite unusual. It's not even a vcf file. It's a compressed vcf file.

I'm not sure if I'm doing it right. I used the following command:

samtools mpileup -uf ~/refs/human_g1k_v37.fasta normal.bam | bcftools view -bvcg - > normal.raw.bcf

When I looked at this bcf file, I found something rather strange.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Unknown
chr1    10114   .       N       T       5.45    .       DP=38;VDB=2.588829e-01;AF1=1;AC1=2;DP4=0,0,11,15;MQ=0;FQ=-105   GT:PL:GQ        1/1:37,78,0:45
chr1    10115   .       N       A       4.76    .       DP=26;VDB=5.001104e-03;AF1=1;AC1=2;DP4=0,0,8,17;MQ=0;FQ=-99     GT:PL:GQ        1/1:36,72,0:42
chr1    10116   .       N       A       9.51    .       DP=26;VDB=1.725300e-02;AF1=1;AC1=2;DP4=0,0,7,17;MQ=0;FQ=-99     GT:PL:GQ        1/1:42,72,0:60
chr1    10117   .       N       C       8.64    .       DP=25;VDB=8.678101e-02;AF1=1;AC1=2;DP4=0,0,6,16;MQ=0;FQ=-93     GT:PL:GQ        1/1:41,66,0:57
chr1    10118   .       N       C       9.51    .       DP=25;VDB=1.464226e-01;AF1=1;AC1=2;DP4=0,0,6,17;MQ=0;FQ=-96     GT:PL:GQ        1/1:42,69,0:60

As you can see all the REF alleles are labeled as N. I'm not sure why it shows that. Can anyone help?

ADD COMMENTlink modified 7.2 years ago by Pierre Lindenbaum130k • written 7.2 years ago by Jordan1.2k

Actually I think I figured out why it's so large. It's recognizing N's as nucleotides in Reference file and hence producing each allele as a variant. So basically each and every position is being labeled as an variant. Hence, the huge file size. But I'm not sure why this is occuring though.

ADD REPLYlink written 7.2 years ago by Jordan1.2k
gravatar for Pierre Lindenbaum
7.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

human_g1k_v37.fasta looks like the reference of the 1KG project: chromosomes doesn't have the 'chr' prefix, while your bam seem to have been mapped on the UCSC reference (with 'chr' prefix).

ADD COMMENTlink written 7.2 years ago by Pierre Lindenbaum130k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1556 users visited in the last hour