Hello Everyone,
I am new to bioinformatics and vcf files. I was following some tutorials online to generate Variant_QC
measures such as call_Rate, n_het, etc for one file under ethnicity Chinese
.
Similarly, I have 100 VCF files under "Chinese" ethnicity.
So, now to generate VARIANT_QC for each variant under Chinese
ethnicity, I would like to combine VCF files (all belongs to Chinese ethnicity) to generate VARIANT_QC.
As I am new to bioinformatics and bcftools,
q1) Am not sure which flag should I choose when I merge? Do I have to normalize before I merge as good practice?
q2) Can the sample Id from one file (ex: P1234) be repeated in another VCF file? If yes, how can I combine the data in all VCF files without any duplicates? When I mean duplicates, am talking based on my experience from SQL, if you have two identical rows (values in all columns are the same), then one of them is considered a duplicate. I would like to avoid such duplicates in my final file
My file name looks like as shown below
ABC_Chinese.1000Gphase3_v5.chr9.dose.vcf.gz
ABC_Chinese.1000Gphase3_v2.chr7.dose.vcf.gz
As shown above, I have 100 files for Chinese ethnicity.
I would like to combine all these 100 files into one single large file. So, I tried the below bcftools command
bcftools merge *Chinese*.vcf.gz O v -o final.vcf.gz
q3) How can I use
regex
to select all VCF files whose name containChinese
and combine them to produce one single large file?
option -O
Hi @Pierre Lidenbaum -Thanks, I understood this
@Pierre Lindenbaum
I got the below error message
Though the sample Ids/names might repeat across files, they are from different chromosome positions/different files and their data could be different as well. Am I right. Why does it still throw duplicate error messages? Does it only look at the header/sample names?
May I know how the output will be like with and without
--force-samples
flag to bcftools merge?However, I modified my command to look like below
May I know what will happen when using
--force-samples flag
and how the output will look like? Can you help me with this?"Merging" VCF can mean two things - merging single sample VCFs or concatenating VCFs that cover (ideally non-overlapping) genomic regions. The latter operation is performed using
bcftools concat
(Please read the manual - all of this is explained there)You cannot merge and concat in a single go using bcftools. Pick and choose your operations and their sequence carefully.