Question: How to extract all the simple repeats from the hg19 reference genome
gravatar for Jackie
9 weeks ago by
United States
Jackie40 wrote:

I am trying to get a comprehensive list of simple repeats (mono-, di-, tri-, tetra-) in the human genome (hg19). I have downloaded the simpleRepeat.txt.gz from UCSC, but seems it is missing some of the repeats we are interested in. For example, chr1:981861-981868[CCCCCCCC], chr1:1116223-1116230[GGGGGGGG] are some mono nucleotide repeats we are interested in looking at, but they are not on the UCSC list. Thus, I was trying to generate a list using TRF, but still, some of the repeats I was trying to get did not get reported by TRF, e.g., chr1:981861-981868[CCCCCCCC], with the default parameters. Can someone provide some insights here:

  1. Is there any place where I can download a really 'comprehensive' simple repeats list from?
  2. If no to question #1, what would be the best way to curate such a list? Is running tools like TRF or RepeatMasker a good idea?
  3. If TRF is something you would suggest, how should I make it report these mono-nucleotide repeats that I was missiong with the default parameters?


ADD COMMENTlink modified 9 weeks ago by Alex Reynolds21k • written 9 weeks ago by Jackie40

Have you looked at the UCSC Table browser? Check in the group "Repeats". There are multiple options available that you can download the data for.

ADD REPLYlink written 9 weeks ago by genomax34k
gravatar for Pierre Lindenbaum
9 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum99k wrote:

the following C program will find the simple repeats:


gcc biostar267241.c


curl -s "" | gunzip -c | ./a.out | grep -E '(981861|1116223)'
chr1    981861  981868  C[7]
chr1    1116223 1116230 G[7]
ADD COMMENTlink written 9 weeks ago by Pierre Lindenbaum99k

Thank you so much for posting the C program. It seems to work perfect, but I have another question. Does this program find only mono- repeats or any repeats with total len >5bp?

ADD REPLYlink written 9 weeks ago by Jackie40

mono-repeat (same base) of len > 5

ADD REPLYlink written 9 weeks ago by Pierre Lindenbaum99k

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 9 weeks ago by WouterDeCoster22k
gravatar for Alex Reynolds
9 weeks ago by
Alex Reynolds21k
Seattle, WA USA
Alex Reynolds21k wrote:

You could download repeats for hg19 from the Repeatmasker folks and convert to BED with convert2bed to do set operations:

$ wget -qO- | gunzip -c - | convert2bed --input=rmsk - > hg19.fa.out.bed

You could do ad-hoc searches with bedops, piping in your region of interest:

$ echo -e 'chr1\t981861\t981868' | bedops -e 1 hg19.fa.out.bed -

Or pass in a file of regions of interest:

$ bedops -e 1 hg19.fa.out.bed roi.bed > answer.bed

Perhaps you could use this with Pierre's binary to construct results with simple repeats and more complex repeat hits.

ADD COMMENTlink written 9 weeks ago by Alex Reynolds21k

Thank you, Alex, that's a great resource, and I have downloaded the repeat masker (RM) file, I think combining the list generated using Pierre's code with this file will give a good starting list.

However, I am still trying to understand why, even this RM file is missing some simple repeats, e.g., a trinucleotide repeat chr1:6680069-6680085 [GAA]n. For those of you who understands RM well, is there some criteria for a simple repeat to be included in the final RM list? e.g., copy number of the unit needs to be >=10, or something like that? as most of these 'longer repeats' are all present in the RM file.

ADD REPLYlink written 9 weeks ago by Jackie40

It's unclear to me what parameters were used to generate these files. The best people to ask would probably be the Repeatmasker folks.

ADD REPLYlink written 9 weeks ago by Alex Reynolds21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 767 users visited in the last hour