Using repeat region filter post QC
5.1 years ago
zx8754 11k

We have NGS whole exome data for 2000 cases and 2000 controls study. Sequencing and QC is done by third party, in theory we already have a "clean" data.

Planning to do gene based tests (skat, burden, etc, or any other analysis), would you advise additionally remove any (all?) of below repeat regions (files are available at UCSC goldenpath):

  • simpleRepeats.txt.gz
  • rmsk.txt.gz
  • genomicSuperDups.txt.gz

We are settled to use simpleRepeats, but no real rationale why. Is it common practice to remove repeat regions, if yes which ones?

We were also suggested to use "blacklists", which ones to use?

5.1 years ago

in the VCF I don't remove anything but I flag the variants in the column FILTER.

I use the following resources to flag the variants: ( mappability, ConsensusExcludable...)

the BED provided in this Heng Li's paper:

the GATK annotation HomopolymerRun

there is also: ( Blacklisted genomic regions for functional genomics analysis )


