Question

Where Can I Find A Good Database Of The Repeat Regions Of The Human Genome ?

7

Entering edit mode

12.6 years ago

Stephwen ▴ 160

Hello everyone,

I'm currently using GASV ( http://code.google.com/p/gasv/ ) to find structural variants in human whole genome data.

To filter out variants which I consider irrelevant, I want to filter out variants situated in highly repetitive regions such as centromeres and telomeres and also other repeat regions on the genome.

Therefore, I'm looking for a database of such repeat regions.

Thanks for your help.

repeats database human genome • 14k views

ADD COMMENT • link updated 2.8 years ago by YexianZhang • 0 • written 12.6 years ago by Stephwen ▴ 160

2

Entering edit mode

My answer in this thread should be what you need.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 12.6 years ago by Alastair Kerr 5.3k

score 6 · Answer 1 · 2011-08-31

6

Entering edit mode

12.6 years ago

Eric Fournier ★ 1.4k

If your data is in genomic coordinates, you could use the UCSC Genome browser table browser tool to extract repeated element information from the RepeatMasker track.

If you have sequences, you could use RepeatMasker and RepBase to determine which parts of your sequences are repetitive in nature.

ADD COMMENT • link 12.6 years ago by Eric Fournier ★ 1.4k

Ram · Answer 2 · 2011-08-31

5

Entering edit mode

12.6 years ago

brentp 24k

THE UCSC has a simpleRepeat database for tandem repeats, the raw data is here (.txt.gz):

Or through mysql:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -N -AB \
-e "SELECT chrom, chromStart, chromEnd from simpleRepeat;" hg19 \
> simpleRepeats.bed

You could also have a look at the mappability tables. Their description is:

These tracks display the level of sequence uniqueness of the reference GRCh37/hg19 genome assembly. They were generated using different window sizes, and high signal will be found in areas where the sequence is unique.

ADD COMMENT • link 12.6 years ago by brentp 24k

0

Entering edit mode

From the database the sequence can be known, but I am clueless how to get the header of the file

585 chr1    10000   10468   trf 6   77.2    6   95  3   789 33  51  0   15  1.43    TAACCC                           
585 chr1    10627   10800   trf 29  6   29  100 0   346 13  38  47  0   1.43    AGGCGCGCCGCGCCGGCGCAGGCGCAGAG                
585 chr1    10757   10997   trf 76  3.2 76  95  2   434 17  30  45  6   1.73    GGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGT
585 chr1    11225   11447   trf 117 1.9 121 80  14  273 12  32  33  20  1.9 CGCCCCCTGCTGGCGACTAGGGCAACTGCAGGGTCCTCTTGCTCAAGGTGAGTGGCAGACGCCCACCTGCTGGCAGCCGGGGACACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCA
585 chr1    11271   11448   trf 61  2.9 61  82  4   187 12  32  34  20  1.9 AGTGGTGGCACGCCACCTGCTGGCAGCTAGGGACACTGCAGGGCCCTCTTGCTCAAGGTAT

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.3 years ago by shrinka.genetics ▴ 40

0

Entering edit mode

Please see https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=rep&hgta_track=simpleRepeat&hgta_table=simpleRepeat&hgta_doSchema=describe+table+schema

ADD REPLY • link 2.8 years ago by YexianZhang • 0

0

Entering edit mode

is the database 0-based or 1 based?

ADD REPLY • link 6.1 years ago by Chen Sun ★ 1.1k

0

Entering edit mode

See: http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/

ADD REPLY • link 6.1 years ago by GenoMax 141k

score 4 · Answer 3 · 2011-08-31

4

Entering edit mode

12.6 years ago

Larry_Parnell 16k

Eric's answer is fine if you wish to use a public source to do the filtering. If you wish to do this in-house, then grab the library of human repeats - here, the RepBase data would be best.

Alastair also provides key points to accomplish this task.

ADD COMMENT • link 12.6 years ago by Larry_Parnell 16k

score 0 · Answer 4 · 2018-10-30

0

Entering edit mode

5.5 years ago

JJ Gao ▴ 50

I was looking for something similar and found Duplicated Genes Database: http://dgd.genouest.org/... in case this is useful for others.

ADD COMMENT • link 5.5 years ago by JJ Gao ▴ 50

score 0 · Answer 5 · 2018-10-30

I'm sure if this is a tangent that should really be a separate thread for discussion, but I noticed that RepBase is having to change it's method of support.

Given that I would use RepBase for the command-line version of RepeatMasker, I am not sure how this affects things (and, in terms of having a .bed or .gtf track, I would download the table from UCSC, as recommended in other responses). However, if you wanted to learn more about the repeat references / annotations, you might want to learn more about the sequences in RepBase.