Entering edit mode
                    12.3 years ago
        Jordan
        
    
        ★
    
    1.3k
    Hi,
I have a bam file of size 62G. When I do the variant analysis by samtools, it gives a huge bcf file like 34G. It's quite unusual. It's not even a vcf file. It's a compressed vcf file.
I'm not sure if I'm doing it right. I used the following command:
samtools mpileup -uf ~/refs/human_g1k_v37.fasta normal.bam | bcftools view -bvcg - > normal.raw.bcf
When I looked at this bcf file, I found something rather strange.
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Unknown
chr1    10114   .       N       T       5.45    .       DP=38;VDB=2.588829e-01;AF1=1;AC1=2;DP4=0,0,11,15;MQ=0;FQ=-105   GT:PL:GQ        1/1:37,78,0:45
chr1    10115   .       N       A       4.76    .       DP=26;VDB=5.001104e-03;AF1=1;AC1=2;DP4=0,0,8,17;MQ=0;FQ=-99     GT:PL:GQ        1/1:36,72,0:42
chr1    10116   .       N       A       9.51    .       DP=26;VDB=1.725300e-02;AF1=1;AC1=2;DP4=0,0,7,17;MQ=0;FQ=-99     GT:PL:GQ        1/1:42,72,0:60
chr1    10117   .       N       C       8.64    .       DP=25;VDB=8.678101e-02;AF1=1;AC1=2;DP4=0,0,6,16;MQ=0;FQ=-93     GT:PL:GQ        1/1:41,66,0:57
chr1    10118   .       N       C       9.51    .       DP=25;VDB=1.464226e-01;AF1=1;AC1=2;DP4=0,0,6,17;MQ=0;FQ=-96     GT:PL:GQ        1/1:42,69,0:60
As you can see all the REF alleles are labeled as N. I'm not sure why it shows that. Can anyone help?
Actually I think I figured out why it's so large. It's recognizing N's as nucleotides in Reference file and hence producing each allele as a variant. So basically each and every position is being labeled as an variant. Hence, the huge file size. But I'm not sure why this is occuring though.