Question: GATK HaplotypeCaller gvcfs smaller for male samples
gravatar for tracyc
22 months ago by
tracyc0 wrote:

Hi all,

I am using GATK's HaplotypeCaller to create GVCFs for some animal samples but am finding that file sizes for male autosomes tend to be smaller than for female autosomes. The input fastq sizes are comparable for male/female samples, as is the number of reads successfully aligned (I used BWA-mem, checked number of reads mapped using SAMtools idxstats) per autosome between male/female samples.

        java -Xmx4g -jar /usr/local/gatk/3.7.0/GenomeAnalysisTK.jar \
        -T HaplotypeCaller \
        -R $ref \
        --dbsnp $dbsnp \
        -I ${in}/${sample}.final.bam \
        --emitRefConfidence GVCF \
        -L $sequence \
        -o ${out}/${sample}.${sequence}.g.vcf

The sizes between chrM and chrUn are comparable which makes this even stranger.

Any ideas? Thanks.

snp • 550 views
ADD COMMENTlink modified 22 months ago • written 22 months ago by tracyc0

How many samples do you have supporting this observation? Is there a difference in linecount (wc -l) for the gvcfs?

ADD REPLYlink written 22 months ago by WouterDeCoster43k

Thanks for your reply, it occurs for 4 males and 4 females. The raw fastqs are similar in size, as well as the bam files. There is a difference in the line count for the gvcfs, e.g. male1.chr1.g.vcf has 16317909 lines, female1.chr1.g.vcf has 32117004 lines. I had a look into each and compared the files. It looks like GATK thinks there are less "active sites" in the male, and hence there are less lines (but each position is still being account for as far as I can see). I am not sure why this would happen. The samples are ovine. I have run the same script on different species and this has not happened before.

ADD REPLYlink written 22 months ago by tracyc0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1632 users visited in the last hour