Entering edit mode
6.1 years ago
tracyc
•
0
Hi all,
I am using GATK's HaplotypeCaller to create GVCFs for some animal samples but am finding that file sizes for male autosomes tend to be smaller than for female autosomes. The input fastq sizes are comparable for male/female samples, as is the number of reads successfully aligned (I used BWA-mem, checked number of reads mapped using SAMtools idxstats) per autosome between male/female samples.
java -Xmx4g -jar /usr/local/gatk/3.7.0/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R $ref \
--dbsnp $dbsnp \
-I ${in}/${sample}.final.bam \
--emitRefConfidence GVCF \
-L $sequence \
-o ${out}/${sample}.${sequence}.g.vcf
The sizes between chrM and chrUn are comparable which makes this even stranger.
Any ideas? Thanks.
How many samples do you have supporting this observation? Is there a difference in linecount (
wc -l
) for the gvcfs?Thanks for your reply, it occurs for 4 males and 4 females. The raw fastqs are similar in size, as well as the bam files. There is a difference in the line count for the gvcfs, e.g. male1.chr1.g.vcf has 16317909 lines, female1.chr1.g.vcf has 32117004 lines. I had a look into each and compared the files. It looks like GATK thinks there are less "active sites" in the male, and hence there are less lines (but each position is still being account for as far as I can see). I am not sure why this would happen. The samples are ovine. I have run the same script on different species and this has not happened before.