How To Mask Repeats In Ngs Data.
5
1
Entering edit mode
12.1 years ago
Daniel ▴ 40

How can I mask repeats in Next Generation sequencing data? I several million NGS reads from a mammalian genome that was not sequenced yet. I would like to filter out those that have a significant hit against the RepBase or Repeatmasker databases. I would appreciate if anybody could give me more specific instructions.

repeats next-gen • 8.5k views
ADD COMMENT
4
Entering edit mode
12.1 years ago
JC 13k

I simply filter reads from repetitive sequences using 2 approaches:

1) simple repeats and low complexity sequences can be filter with DUST or I just compute the complexity of the sequence using entropy or compression ratio.

2) interspersed repeats can be filters if you map the reads to the RepBase consensi with Bowtie, BWA or Blat (with -fastMap), this step can filter millions of reads in a few minutes.

If you are expecting a lot of repetitive sequences (as in genome genome sequencing), I strongly suggest to filter first before mapping/assembling, otherwise it doesn't gives you any advantage.

ADD COMMENT
0
Entering edit mode

could you clarify #2? you mean you'd filter reads that map to multiple places?

ADD REPLY
0
Entering edit mode

No, you can map the reads to the consensi sequences from known repeats obtained from RepBase or any other source filtering out those reads that match.

ADD REPLY
0
Entering edit mode

Dear JC,

I have some very basic questions about how to map reads to the Repbase consensi, Could you please give me details on? - What is a Repbase consensus? Is it distinct for each repeat family? Is it distinct over species? - Where can I find it/them for Human? - Do I build a regular bowtie2 index from this consensus file?

Many thanks,

ADD REPLY
2
Entering edit mode
12.1 years ago

You could either:

  • directly repeatmask your data : http://www.repeatmasker.org/
  • map your data against a close mammalian genome and cross the matching positions with the repeatmasker positions
ADD COMMENT
2
Entering edit mode
12.1 years ago
Ian 6.0k

Just a thought (i.e. not sure it is practical/possible). But if you could obtain the repetitive sequences from RepBase you could use them as the reference sequences for an NGS sequence aligner, e.g. Bowtie. Any uniquely mapping reads could be excluded from your sample.

ADD COMMENT
1
Entering edit mode
10.4 years ago
Biojl ★ 1.7k

A very simple and easy to use tool is SEG. It will replace the repeats and/or LCR in your sequences for 'XXXX'

ADD COMMENT
1
Entering edit mode

This is for protein sequences. It is also for masking mathematical repeats, not the type OP is interested in (which is identifying sequences using a reference library).

ADD REPLY
1
Entering edit mode
7.9 years ago

You can use "tantan" which is used by LAST in orderto mask genomes before comparing

ADD COMMENT

Login before adding your answer.

Traffic: 2802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6