A recent paper by Heng Li (http://arxiv.org/abs/1404.0929) discusses two major filters that can be used to remove artifacts when calling variants. They are:
- Filtering by excessive depth (Currently being accomplished using vcfutils -D <large depth> ~ 1000>
- Filtering out low complexity regions - He accomplishes this using mdust and subtracting the regions out using a bedfile.
So - I am trying to apply these filters on sequencing done in C. elegans and am currently stuck on the second filter. Heng li provides an LCR file on github here (https://github.com/lh3/varcmp/tree/master/scripts), "LCR-hs38.bed.gz"
I need to find or generate a file like this for C. elegans. Once this bedfile is produced it can be used to subtract LCR variants from a VCF or GFF using bedtools substract.
As a starting point - a masked version of the C. elegans ce10 genome exists at UCSC in fasta format:
UCSC uses repeatmasker to do the masking.
For those curious - here is how I generated the necessary file:
wget 'http://hgdownload.soe.ucsc.edu/goldenPath/ce10/database/rmsk.txt.gz' -O LCR_rmsk.txt.gz
gunzip -kfc LCR_rmsk.txt.gz | cut -f 6,7,8 > LCR_rmsk.txt