Question

How to extract all the simple repeats from the hg19 reference genome

0

Entering edit mode

6.7 years ago

Jackie ▴ 70

I am trying to get a comprehensive list of simple repeats (mono-, di-, tri-, tetra-) in the human genome (hg19). I have downloaded the simpleRepeat.txt.gz from UCSC, but seems it is missing some of the repeats we are interested in. For example, chr1:981861-981868[CCCCCCCC], chr1:1116223-1116230[GGGGGGGG] are some mono nucleotide repeats we are interested in looking at, but they are not on the UCSC list. Thus, I was trying to generate a list using TRF, but still, some of the repeats I was trying to get did not get reported by TRF, e.g., chr1:981861-981868[CCCCCCCC], with the default parameters. Can someone provide some insights here:

Is there any place where I can download a really 'comprehensive' simple repeats list from?
If no to question #1, what would be the best way to curate such a list? Is running tools like TRF or RepeatMasker a good idea?
If TRF is something you would suggest, how should I make it report these mono-nucleotide repeats that I was missiong with the default parameters?

Thanks

simple repeats reference genome • 3.8k views

ADD COMMENT • link updated 6.7 years ago by Alex Reynolds 35k • written 6.7 years ago by Jackie ▴ 70

0

Entering edit mode

Have you looked at the UCSC Table browser? Check in the group "Repeats". There are multiple options available that you can download the data for.

ADD REPLY • link 6.7 years ago by GenoMax 141k

score 3 · Accepted Answer · 2017-08-11

3

Entering edit mode

6.7 years ago

Pierre Lindenbaum 161k

the following C program will find the simple repeats:

compile

gcc biostar267241.c

example:

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz" | gunzip -c | ./a.out | grep -E '(981861|1116223)'
chr1    981861  981868  C[7]
chr1    1116223 1116230 G[7]

ADD COMMENT • link 6.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you so much for posting the C program. It seems to work perfect, but I have another question. Does this program find only mono- repeats or any repeats with total len >5bp?

ADD REPLY • link 6.7 years ago by Jackie ▴ 70

0

Entering edit mode

mono-repeat (same base) of len > 5

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY • link 6.7 years ago by WouterDeCoster 47k

score 2 · Accepted Answer · 2017-08-11

2

Entering edit mode

6.7 years ago

Alex Reynolds 35k

You could download repeats for hg19 from the Repeatmasker folks and convert to BED with convert2bed to do set operations:

$ wget -qO- http://www.repeatmasker.org/genomes/hg19/RepeatMasker-rm405-db20140131/hg19.fa.out.gz | gunzip -c - | convert2bed --input=rmsk - > hg19.fa.out.bed

You could do ad-hoc searches with bedops, piping in your region of interest:

$ echo -e 'chr1\t981861\t981868' | bedops -e 1 hg19.fa.out.bed -

Or pass in a file of regions of interest:

$ bedops -e 1 hg19.fa.out.bed roi.bed > answer.bed

Perhaps you could use this with Pierre's binary to construct results with simple repeats and more complex repeat hits.

ADD COMMENT • link 6.7 years ago by Alex Reynolds 35k

0

Entering edit mode

Thank you, Alex, that's a great resource, and I have downloaded the repeat masker (RM) file, I think combining the list generated using Pierre's code with this file will give a good starting list.

However, I am still trying to understand why, even this RM file is missing some simple repeats, e.g., a trinucleotide repeat chr1:6680069-6680085 [GAA]n. For those of you who understands RM well, is there some criteria for a simple repeat to be included in the final RM list? e.g., copy number of the unit needs to be >=10, or something like that? as most of these 'longer repeats' are all present in the RM file.

ADD REPLY • link 6.7 years ago by Jackie ▴ 70

0

Entering edit mode

It's unclear to me what parameters were used to generate these files. The best people to ask would probably be the Repeatmasker folks.

ADD REPLY • link 6.7 years ago by Alex Reynolds 35k