Question: How to generate a masked vcf file?
3
gravatar for Daniel E Cook
6.4 years ago by
Daniel E Cook240
Chicago
Daniel E Cook240 wrote:

A recent paper by Heng Li (http://arxiv.org/abs/1404.0929) discusses two major filters that can be used to remove artifacts when calling variants. They are:

  • Filtering by excessive depth (Currently being accomplished using vcfutils -D <large depth>  ~ 1000>
  • Filtering out low complexity regions - He accomplishes this using mdust and subtracting the regions out using a bedfile.

So - I am trying to apply these filters on sequencing done in C. elegans and am currently stuck on the second filter. Heng li provides an LCR file on github here (https://github.com/lh3/varcmp/tree/master/scripts), "LCR-hs38.bed.gz"

I need to find or generate a file like this for C. elegans. Once this bedfile is produced it can be used to subtract LCR variants from a VCF or GFF using bedtools substract

As a starting point - a masked version of the C. elegans ce10 genome exists at UCSC in fasta format: 

http://hgdownload.cse.ucsc.edu/goldenPath/ce10/bigZips/chromFaMasked.tar.gz

UCSC uses repeatmasker to do the masking.

EDIT

For those curious - here is how I generated the necessary file:

wget 'http://hgdownload.soe.ucsc.edu/goldenPath/ce10/database/rmsk.txt.gz' -O LCR_rmsk.txt.gz
gunzip -kfc LCR_rmsk.txt.gz | cut -f 6,7,8 > LCR_rmsk.txt

Thanks Vivek!

fasta variant calling vcf • 3.4k views
ADD COMMENTlink modified 6.2 years ago • written 6.4 years ago by Daniel E Cook240
2
gravatar for Daniel E Cook
6.2 years ago by
Daniel E Cook240
Chicago
Daniel E Cook240 wrote:

Thanks I used the following code to take care of this!

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Daniel E Cook240
0
gravatar for Vivek
6.4 years ago by
Vivek2.4k
Denmark
Vivek2.4k wrote:

You could try aligning the masked sequence to unmasked reference using Blat and use the alignment gaps to create a bed file of N regions.

Edit: There's a repeatmasker regions file in UCSC for C. elegans that you can get the coordinates from

rmsk.txt.gz

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Vivek2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 836 users visited in the last hour