Question: bcftools error duplicate samples
gravatar for abyousaf
9 weeks ago by
abyousaf0 wrote:

I am trying to merge vcf files across chromosomes 1-22. I am using bcftools v1.9 in order to do this. The code I am using is bcftools merge 'myfile1.vcf.gz' 'myfile2.vcf.gz'etc....'myfile22.vcf.gz' -o myfile1_22.vcf.gz

However I get the following error: "Error: Duplicate sample names (1310229_1310229), use --force-samples to proceed anyway."

I'm afraid to use --force-samples because I don't understand how this will affect the merged vcf file and how many duplicates there are. The data is from the UK Biobank and the VCF files are massive in size (total across chromosomes =1.3TB).

Any suggestions to actually solve the error rather than use --force-samples?

NOTE: I am VERY VERY new to biostatistical analysis. I appreciate your advice heavily. I would appreciate it more if your advice was structured for a beginner.

snp R gene • 192 views
ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by abyousaf0

I checked the headers and found out the first sample is 1310229. I think if I use force samples, it will prepend every single sample but I don’t know why. Any ideas?

ADD REPLYlink written 9 weeks ago by abyousaf0
gravatar for Medhat
9 weeks ago by
Medhat8.8k wrote:

No problems will happen.
Using --force-samples will prepend the index of the sample to its name in the merging file.
As what happened in sample S3 below:

when merging file A.vcf.gz containing samples S1, S2 and S3 and file B.vcf.gz containing samples S3 and S4, the output file will contain four samples named S1, S2, S3, 2:S3 and S4.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Medhat8.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1499 users visited in the last hour