Quality control on 1000 genome vcf file
1
0
Entering edit mode
5.2 years ago
seta ★ 1.9k

Hi everybody,

I was wondering if (and which) quality control steps should be done on 1000 genome vcf files in the case of using them along with our population-specific vcf files? please kindly let me know your QC steps on this vcf files?

Thanks

quality control 1000 genome vcf • 2.4k views
ADD COMMENT
1
Entering edit mode
5.2 years ago

The 1000 Genomes public data is already of good quality. How you filter it will depend on your downstream analysis and what is required by it. For simple / general 'housekeeping', take a look at Step 4 from here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

That is:

  • Ensure that multi-allelic calls are split and that indels are left-aligned compared to reference genome (1st pipe)
  • Sets the ID field to a unique value: CHROM:POS:REF:ALT (2nd pipe)
  • Removes duplicates (3rd pipe)

-I +'%CHROM:%POS:%REF:%ALT' means that unset IDs will be set to CHROM:POS:REF:ALT

-x ID -I +'%CHROM:%POS:%REF:%ALT' first erases the current ID and then sets it to CHROM:POS:REF:ALT

for chr in {1..22}; do
    bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
    ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \

    bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' |

    bcftools norm -Ob --rm-dup both \
    > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

    bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;
done

Note - you obviously don't have to re-set the ID field if you do not wish to do so

Regarding duplicates, rs ID duplicates exist in 1000 Genomes data, as the same rs ID can reference multiple positions. I believe there are also genuine duplicates, i.e., have the same CHROM:POS:REF:ALT. The above deals with both, but involves re-setting the ID.

You can also readily remove multi-allelic calls because they will have the MULTI_ALLELIC tag in the INFO field.

As far as I am aware, the 1000 Genomes data from Ensembl also only contains variants with PASS in the FILTER field.

Kevin

ADD COMMENT
1
Entering edit mode

Many thanks for your nice explanation

ADD REPLY

Login before adding your answer.

Traffic: 2470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6