Question

bcftools merge - regex pattern and flag to choose?

0

Entering edit mode

3.6 years ago

akshaykum684 ▴ 20

Hello Everyone,

I am new to bioinformatics and vcf files. I was following some tutorials online to generate Variant_QC measures such as call_Rate, n_het, etc for one file under ethnicity Chinese.

Similarly, I have 100 VCF files under "Chinese" ethnicity.

So, now to generate VARIANT_QC for each variant under Chinese ethnicity, I would like to combine VCF files (all belongs to Chinese ethnicity) to generate VARIANT_QC.

As I am new to bioinformatics and bcftools,

q1) Am not sure which flag should I choose when I merge? Do I have to normalize before I merge as good practice?

q2) Can the sample Id from one file (ex: P1234) be repeated in another VCF file? If yes, how can I combine the data in all VCF files without any duplicates? When I mean duplicates, am talking based on my experience from SQL, if you have two identical rows (values in all columns are the same), then one of them is considered a duplicate. I would like to avoid such duplicates in my final file

My file name looks like as shown below

ABC_Chinese.1000Gphase3_v5.chr9.dose.vcf.gz

ABC_Chinese.1000Gphase3_v2.chr7.dose.vcf.gz

As shown above, I have 100 files for Chinese ethnicity.

I would like to combine all these 100 files into one single large file. So, I tried the below bcftools command

bcftools merge *Chinese*.vcf.gz O v -o final.vcf.gz

q3) How can I use regex to select all VCF files whose name contain Chinese and combine them to produce one single large file?

sequence sequencing Assembly SNP genome • 1.3k views

ADD COMMENT • link 3.6 years ago by akshaykum684 ▴ 20

1

Entering edit mode

option -O

bcftools merge -O z -o final.vcf.gz  *Chinese*.vcf.gz

ADD REPLY • link 3.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi @Pierre Lidenbaum -Thanks, I understood this

ADD REPLY • link 3.6 years ago by akshaykum684 ▴ 20

1

Entering edit mode

q3) How can I use regex to select all VCF files whose name contain Chinese and combine them to produce one single large file?

ls *.vcf.gz | grep Chinese >  list.txt
bcftools merge -O z -o final.vcf.gz  --file-list list.txt

ADD REPLY • link 3.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

@Pierre Lindenbaum

I got the below error message

Error: Duplicate sample names (S1), use --force-samples to proceed anyway.

Though the sample Ids/names might repeat across files, they are from different chromosome positions/different files and their data could be different as well. Am I right. Why does it still throw duplicate error messages? Does it only look at the header/sample names?

May I know how the output will be like with and without --force-samples flag to bcftools merge?

However, I modified my command to look like below

bcftools merge --force-samples -O z -o final.vcf.gz Chinese.vcf.gz

May I know what will happen when using --force-samples flag and how the output will look like? Can you help me with this?

ADD REPLY • link 3.6 years ago by akshaykum684 ▴ 20

0

Entering edit mode

"Merging" VCF can mean two things - merging single sample VCFs or concatenating VCFs that cover (ideally non-overlapping) genomic regions. The latter operation is performed using bcftools concat (Please read the manual - all of this is explained there)

You cannot merge and concat in a single go using bcftools. Pick and choose your operations and their sequence carefully.

ADD REPLY • link 3.6 years ago by Ram 43k