How to generate a masked vcf file?
1
3
Entering edit mode
9.9 years ago
Daniel E Cook ▴ 280

A recent paper by Heng Li discusses two major filters that can be used to remove artifacts when calling variants. They are:

  • Filtering by excessive depth (Currently being accomplished using vcfutils -D <large depth> ~ 1000>
  • Filtering out low complexity regions - He accomplishes this using mdust and subtracting the regions out using a bedfile.

So - I am trying to apply these filters on sequencing done in C. elegans and am currently stuck on the second filter. Heng li provides an LCR file on github [here][2], "LCR-hs38.bed.gz"

I need to find or generate a file like this for C. elegans. Once this bedfile is produced it can be used to subtract LCR variants from a VCF or GFF using bedtools subtract.

As a starting point - a masked version of the C. elegans ce10 genome exists at UCSC in fasta format: http://hgdownload.cse.ucsc.edu/goldenPath/ce10/bigZips/chromFaMasked.tar.gz

UCSC uses repeatmasker to do the masking.

EDIT

For those curious - here is how I generated the necessary file:

wget 'http://hgdownload.soe.ucsc.edu/goldenPath/ce10/database/rmsk.txt.gz' -O LCR_rmsk.txt.gz

gunzip -kfc LCR_rmsk.txt.gz | cut -f 6,7,8 > LCR_rmsk.txt

Thanks Vivek!

[2] https://github.com/lh3/varcmp/tree/master/scripts

vcf variant-calling fasta • 4.8k views
ADD COMMENT
2
Entering edit mode

Thanks I used the following code to take care of this!

ADD REPLY
0
Entering edit mode
9.9 years ago
Vivek ★ 2.7k

You could try aligning the masked sequence to unmasked reference using Blat and use the alignment gaps to create a bed file of N regions.

Edit: There's a repeatmasker regions file in UCSC for C. elegans that you can get the coordinates from

rmsk.txt.gz

ADD COMMENT

Login before adding your answer.

Traffic: 3138 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6