Question: bcftools merge - regex pattern and flag to choose?
0
gravatar for akshaykum684
12 days ago by
akshaykum6840 wrote:

Hello Everyone,

I am new to bioinformatics and vcf files. I was following some tutorials online to generate Variant_QC measures such as call_Rate, n_het, etc for one file under ethnicity Chinese.

Similarly, I have 100 VCF files under "Chinese" ethnicity.

So, now to generate VARIANT_QC for each variant under Chinese ethnicity, I would like to combine VCF files (all belongs to Chinese ethnicity) to generate VARIANT_QC.

As I am new to bioinformatics and bcftools,

q1) Am not sure which flag should I choose when I merge? Do I have to normalize before I merge as good practice?

q2) Can the sample Id from one file (ex: P1234) be repeated in another VCF file? If yes, how can I combine the data in all VCF files without any duplicates? When I mean duplicates, am talking based on my experience from SQL, if you have two identical rows (values in all columns are the same), then one of them is considered a duplicate. I would like to avoid such duplicates in my final file

My file name looks like as shown below

ABC_Chinese.1000Gphase3_v5.chr9.dose.vcf.gz

ABC_Chinese.1000Gphase3_v2.chr7.dose.vcf.gz

As shown above, I have 100 files for Chinese ethnicity.

I would like to combine all these 100 files into one single large file. So, I tried the below bcftools command

bcftools merge *Chinese*.vcf.gz O v -o final.vcf.gz

q3) How can I use regex to select all VCF files whose name contain Chinese and combine them to produce one single large file?

ADD COMMENTlink modified 10 days ago • written 12 days ago by akshaykum6840
1

option -O

bcftools merge -O z -o final.vcf.gz  *Chinese*.vcf.gz
ADD REPLYlink written 12 days ago by Pierre Lindenbaum130k

Hi @Pierre Lidenbaum -Thanks, I understood this

ADD REPLYlink modified 12 days ago • written 12 days ago by akshaykum6840
1

q3) How can I use regex to select all VCF files whose name contain Chinese and combine them to produce one single large file?

ls *.vcf.gz | grep Chinese >  list.txt
bcftools merge -O z -o final.vcf.gz  --file-list list.txt
ADD REPLYlink written 12 days ago by Pierre Lindenbaum130k

@Pierre Lindenbaum

I got the below error message

Error: Duplicate sample names (S1), use --force-samples to proceed anyway.

Though the sample Ids/names might repeat across files, they are from different chromosome positions/different files and their data could be different as well. Am I right. Why does it still throw duplicate error messages? Does it only look at the header/sample names?

May I know how the output will be like with and without --force-samples flag to bcftools merge?

However, I modified my command to look like below

bcftools merge --force-samples -O z -o final.vcf.gz Chinese.vcf.gz

May I know what will happen when using --force-samples flag and how the output will look like? Can you help me with this?

ADD REPLYlink modified 12 days ago • written 12 days ago by akshaykum6840

"Merging" VCF can mean two things - merging single sample VCFs or concatenating VCFs that cover (ideally non-overlapping) genomic regions. The latter operation is performed using bcftools concat (Please read the manual - all of this is explained there)

You cannot merge and concat in a single go using bcftools. Pick and choose your operations and their sequence carefully.

ADD REPLYlink written 12 days ago by RamRS30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1561 users visited in the last hour