I am new to bioinformatics and vcf files. I was following some tutorials online to generate
Variant_QC measures such as call_Rate, n_het, etc for one file under ethnicity
Similarly, I have 100 VCF files under "Chinese" ethnicity.
So, now to generate VARIANT_QC for each variant under
Chinese ethnicity, I would like to combine VCF files (all belongs to Chinese ethnicity) to generate VARIANT_QC.
As I am new to bioinformatics and bcftools,
q1) Am not sure which flag should I choose when I merge? Do I have to normalize before I merge as good practice?
q2) Can the sample Id from one file (ex: P1234) be repeated in another VCF file? If yes, how can I combine the data in all VCF files without any duplicates? When I mean duplicates, am talking based on my experience from SQL, if you have two identical rows (values in all columns are the same), then one of them is considered a duplicate. I would like to avoid such duplicates in my final file
My file name looks like as shown below
As shown above, I have 100 files for Chinese ethnicity.
I would like to combine all these 100 files into one single large file. So, I tried the below bcftools command
bcftools merge *Chinese*.vcf.gz O v -o final.vcf.gz
q3) How can I use
regexto select all VCF files whose name contain
Chineseand combine them to produce one single large file?