Missing samples in the output vcf file created using GenotypeGVCFs in GATK
1
0
Entering edit mode
16 months ago
kk.mahsa ▴ 140

Hi everyone

I used the following method to create a VCF file with 50 samples.

For each sample

 java -jar gatk_3.7-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R Ref.fasta -I input.bam -o output.g.vcf.gz -ERC GVCF

and then for all samples

java -Xmx64G -jar gatk_3.7-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Ref.fasta -V output1.g.vcf.gz -V output2.g.vcf.gz ... output50.g.vcf.gz -o 50_samples.vcf.gz

But my final VCF file has 45 samples instead of 50 samples. I don't get any error messages.

Can anyone help me to solve this problem?

SNP GATK • 840 views
ADD COMMENT
0
Entering edit mode
16 months ago

check you have 50 disctinct samples: what is the output of:

for F in output*.g.vcf ; do bcfools query -l "${F}" ; done | sort | uniq | cat -n 
ADD COMMENT
0
Entering edit mode

I ran your command on the 15 samples (Sam1 to Sam15) that the missing data belonged to them and the command output was:

1 sam1

2 sam10

3 sam11

4 sam12

5 sam13

6 sam14

7 sam15

8 sam2

9 sam3

10 sam7

11 sam8

12 sam9

what is that mean?

ADD REPLY
1
Entering edit mode

what is that mean

you only have 19 distinct samples in all your VCF. . My guess is that the original BAMs have the same samples. Check with

cat list.of.path.to.bam | samtools samples
ADD REPLY
0
Entering edit mode

Thanks dear Pierre for your help; I checked the BAM files and realized that I made a mistake in adding the sample name to the BAM files and it is a duplicate name

ADD REPLY

Login before adding your answer.

Traffic: 1929 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6