bcftools merge - regex pattern and flag to choose?
0
0
Entering edit mode
3.6 years ago
akshaykum684 ▴ 20

Hello Everyone,

I am new to bioinformatics and vcf files. I was following some tutorials online to generate Variant_QC measures such as call_Rate, n_het, etc for one file under ethnicity Chinese.

Similarly, I have 100 VCF files under "Chinese" ethnicity.

So, now to generate VARIANT_QC for each variant under Chinese ethnicity, I would like to combine VCF files (all belongs to Chinese ethnicity) to generate VARIANT_QC.

As I am new to bioinformatics and bcftools,

q1) Am not sure which flag should I choose when I merge? Do I have to normalize before I merge as good practice?

q2) Can the sample Id from one file (ex: P1234) be repeated in another VCF file? If yes, how can I combine the data in all VCF files without any duplicates? When I mean duplicates, am talking based on my experience from SQL, if you have two identical rows (values in all columns are the same), then one of them is considered a duplicate. I would like to avoid such duplicates in my final file

My file name looks like as shown below

ABC_Chinese.1000Gphase3_v5.chr9.dose.vcf.gz

ABC_Chinese.1000Gphase3_v2.chr7.dose.vcf.gz

As shown above, I have 100 files for Chinese ethnicity.

I would like to combine all these 100 files into one single large file. So, I tried the below bcftools command

bcftools merge *Chinese*.vcf.gz O v -o final.vcf.gz

q3) How can I use regex to select all VCF files whose name contain Chinese and combine them to produce one single large file?

sequence sequencing Assembly SNP genome • 1.3k views
ADD COMMENT
1
Entering edit mode

option -O

bcftools merge -O z -o final.vcf.gz  *Chinese*.vcf.gz
ADD REPLY
0
Entering edit mode

Hi @Pierre Lidenbaum -Thanks, I understood this

ADD REPLY
1
Entering edit mode

q3) How can I use regex to select all VCF files whose name contain Chinese and combine them to produce one single large file?

ls *.vcf.gz | grep Chinese >  list.txt
bcftools merge -O z -o final.vcf.gz  --file-list list.txt
ADD REPLY
0
Entering edit mode

@Pierre Lindenbaum

I got the below error message

Error: Duplicate sample names (S1), use --force-samples to proceed anyway.

Though the sample Ids/names might repeat across files, they are from different chromosome positions/different files and their data could be different as well. Am I right. Why does it still throw duplicate error messages? Does it only look at the header/sample names?

May I know how the output will be like with and without --force-samples flag to bcftools merge?

However, I modified my command to look like below

bcftools merge --force-samples -O z -o final.vcf.gz Chinese.vcf.gz

May I know what will happen when using --force-samples flag and how the output will look like? Can you help me with this?

ADD REPLY
0
Entering edit mode

"Merging" VCF can mean two things - merging single sample VCFs or concatenating VCFs that cover (ideally non-overlapping) genomic regions. The latter operation is performed using bcftools concat (Please read the manual - all of this is explained there)

You cannot merge and concat in a single go using bcftools. Pick and choose your operations and their sequence carefully.

ADD REPLY

Login before adding your answer.

Traffic: 2698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6