Hi everybody,
I was wondering if (and which) quality control steps should be done on 1000 genome vcf files in the case of using them along with our population-specific vcf files? please kindly let me know your QC steps on this vcf files?
Thanks
Hi everybody,
I was wondering if (and which) quality control steps should be done on 1000 genome vcf files in the case of using them along with our population-specific vcf files? please kindly let me know your QC steps on this vcf files?
Thanks
The 1000 Genomes public data is already of good quality. How you filter it will depend on your downstream analysis and what is required by it. For simple / general 'housekeeping', take a look at Step 4 from here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2
That is:
-I +'%CHROM:%POS:%REF:%ALT'
means that unset IDs will be set to CHROM:POS:REF:ALT
-x ID -I +'%CHROM:%POS:%REF:%ALT'
first erases the current ID and then sets it to CHROM:POS:REF:ALT
for chr in {1..22}; do
bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \
bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' |
bcftools norm -Ob --rm-dup both \
> ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;
bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;
done
Note - you obviously don't have to re-set the ID field if you do not wish to do so
Regarding duplicates, rs ID duplicates exist in 1000 Genomes data, as the same rs ID can reference multiple positions. I believe there are also genuine duplicates, i.e., have the same CHROM:POS:REF:ALT
. The above deals with both, but involves re-setting the ID.
You can also readily remove multi-allelic calls because they will have the MULTI_ALLELIC
tag in the INFO field.
As far as I am aware, the 1000 Genomes data from Ensembl also only contains variants with PASS
in the FILTER field.
Kevin
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Many thanks for your nice explanation