Where Can I Find A Good Database Of The Repeat Regions Of The Human Genome ?
5
7
Entering edit mode
11.3 years ago
Stephwen ▴ 160

Hello everyone,

I'm currently using GASV ( http://code.google.com/p/gasv/ ) to find structural variants in human whole genome data.

To filter out variants which I consider irrelevant, I want to filter out variants situated in highly repetitive regions such as centromeres and telomeres and also other repeat regions on the genome.

Therefore, I'm looking for a database of such repeat regions.

repeats database human genome • 13k views
2
Entering edit mode

6
Entering edit mode
11.3 years ago
Eric Fournier ★ 1.4k

If your data is in genomic coordinates, you could use the UCSC Genome browser table browser tool to extract repeated element information from the RepeatMasker track.

If you have sequences, you could use RepeatMasker and RepBase to determine which parts of your sequences are repetitive in nature.

5
Entering edit mode
11.3 years ago
brentp 24k

THE UCSC has a simpleRepeat database for tandem repeats, the raw data is here (.txt.gz):

Or through mysql:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -N -AB \
-e "SELECT chrom, chromStart, chromEnd from simpleRepeat;" hg19 \
> simpleRepeats.bed


You could also have a look at the mappability tables. Their description is:

These tracks display the level of sequence uniqueness of the reference GRCh37/hg19 genome assembly. They were generated using different window sizes, and high signal will be found in areas where the sequence is unique.

0
Entering edit mode

From the database the sequence can be known, but I am clueless how to get the header of the file

585 chr1    10000   10468   trf 6   77.2    6   95  3   789 33  51  0   15  1.43    TAACCC
585 chr1    10627   10800   trf 29  6   29  100 0   346 13  38  47  0   1.43    AGGCGCGCCGCGCCGGCGCAGGCGCAGAG
585 chr1    10757   10997   trf 76  3.2 76  95  2   434 17  30  45  6   1.73    GGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGT
585 chr1    11225   11447   trf 117 1.9 121 80  14  273 12  32  33  20  1.9 CGCCCCCTGCTGGCGACTAGGGCAACTGCAGGGTCCTCTTGCTCAAGGTGAGTGGCAGACGCCCACCTGCTGGCAGCCGGGGACACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCA
585 chr1    11271   11448   trf 61  2.9 61  82  4   187 12  32  34  20  1.9 AGTGGTGGCACGCCACCTGCTGGCAGCTAGGGACACTGCAGGGCCCTCTTGCTCAAGGTAT

0
Entering edit mode
0
Entering edit mode

is the database 0-based or 1 based?

0
Entering edit mode
4
Entering edit mode
11.3 years ago

Eric's answer is fine if you wish to use a public source to do the filtering. If you wish to do this in-house, then grab the library of human repeats - here, the RepBase data would be best.

Alastair also provides key points to accomplish this task.

0
Entering edit mode
4.1 years ago
JJ Gao ▴ 50

I was looking for something similar and found Duplicated Genes Database: http://dgd.genouest.org/... in case this is useful for others.

0
Entering edit mode
4.1 years ago

I'm sure if this is a tangent that should really be a separate thread for discussion, but I noticed that RepBase is having to change it's method of support.

Given that I would use RepBase for the command-line version of RepeatMasker, I am not sure how this affects things (and, in terms of having a .bed or .gtf track, I would download the table from UCSC, as recommended in other responses). However, if you wanted to learn more about the repeat references / annotations, you might want to learn more about the sequences in RepBase.