Question: Where Can I Find A Good Database Of The Repeat Regions Of The Human Genome ?
7
gravatar for Stephwen
8.6 years ago by
Stephwen140
Belgium
Stephwen140 wrote:

Hello everyone,

I'm currently using GASV ( http://code.google.com/p/gasv/ ) to find structural variants in human whole genome data.

To filter out variants which I consider irrelevant, I want to filter out variants situated in highly repetitive regions such as centromeres and telomeres and also other repeat regions on the genome.

Therefore, I'm looking for a database of such repeat regions.

Thanks for your help.

genome repeats database human • 10k views
ADD COMMENTlink modified 17 months ago by Charles Warden7.6k • written 8.6 years ago by Stephwen140
2

My answer in this thread should be what you need.

ADD REPLYlink modified 4 months ago by RamRS26k • written 8.6 years ago by Alastair Kerr5.2k
6
gravatar for Eric Fournier
8.6 years ago by
Eric Fournier1.4k
Quebec, Canada
Eric Fournier1.4k wrote:

If your data is in genomic coordinates, you could use the UCSC Genome browser table browser tool to extract repeated element information from the RepeatMasker track.

If you have sequences, you could use RepeatMasker and RepBase to determine which parts of your sequences are repetitive in nature.

ADD COMMENTlink written 8.6 years ago by Eric Fournier1.4k
5
gravatar for brentp
8.6 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

THE UCSC has a simpleRepeat database for tandem repeats, the raw data is here (.txt.gz):

Or through mysql:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -N -AB \
-e "SELECT chrom, chromStart, chromEnd from simpleRepeat;" hg19 \
> simpleRepeats.bed

You could also have a look at the mappability tables. Their description is:

These tracks display the level of sequence uniqueness of the reference GRCh37/hg19 genome assembly. They were generated using different window sizes, and high signal will be found in areas where the sequence is unique.

ADD COMMENTlink written 8.6 years ago by brentp23k

From the database the sequence can be known, but I am clueless how to get the header of the file

585 chr1    10000   10468   trf 6   77.2    6   95  3   789 33  51  0   15  1.43    TAACCC                           
585 chr1    10627   10800   trf 29  6   29  100 0   346 13  38  47  0   1.43    AGGCGCGCCGCGCCGGCGCAGGCGCAGAG                
585 chr1    10757   10997   trf 76  3.2 76  95  2   434 17  30  45  6   1.73    GGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGT
585 chr1    11225   11447   trf 117 1.9 121 80  14  273 12  32  33  20  1.9 CGCCCCCTGCTGGCGACTAGGGCAACTGCAGGGTCCTCTTGCTCAAGGTGAGTGGCAGACGCCCACCTGCTGGCAGCCGGGGACACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCA
585 chr1    11271   11448   trf 61  2.9 61  82  4   187 12  32  34  20  1.9 AGTGGTGGCACGCCACCTGCTGGCAGCTAGGGACACTGCAGGGCCCTCTTGCTCAAGGTAT
ADD REPLYlink modified 4 months ago by RamRS26k • written 4.3 years ago by shrinka.genetics0

is the database 0-based or 1 based?

ADD REPLYlink written 2.0 years ago by Chen990
4
gravatar for Larry_Parnell
8.6 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

Eric's answer is fine if you wish to use a public source to do the filtering. If you wish to do this in-house, then grab the library of human repeats - here, the RepBase data would be best.

Alastair also provides key points to accomplish this task.

ADD COMMENTlink written 8.6 years ago by Larry_Parnell16k
0
gravatar for JJ Gao
17 months ago by
JJ Gao50
United States/New York/MSKCC
JJ Gao50 wrote:

I was looking for something similar and found Duplicated Genes Database: http://dgd.genouest.org/... in case this is useful for others.

ADD COMMENTlink written 17 months ago by JJ Gao50
0
gravatar for Charles Warden
17 months ago by
Charles Warden7.6k
Duarte, CA
Charles Warden7.6k wrote:

I'm sure if this is a tangent that should really be a separate thread for discussion, but I noticed that RepBase is having to change it's method of support.

Given that I would use RepBase for the command-line version of RepeatMasker, I am not sure how this affects things (and, in terms of having a .bed or .gtf track, I would download the table from UCSC, as recommended in other responses). However, if you wanted to learn more about the repeat references / annotations, you might want to learn more about the sequences in RepBase.

ADD COMMENTlink written 17 months ago by Charles Warden7.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1259 users visited in the last hour