How to extract all the simple repeats from the hg19 reference genome
2
0
Entering edit mode
6.7 years ago
Jackie ▴ 70

I am trying to get a comprehensive list of simple repeats (mono-, di-, tri-, tetra-) in the human genome (hg19). I have downloaded the simpleRepeat.txt.gz from UCSC, but seems it is missing some of the repeats we are interested in. For example, chr1:981861-981868[CCCCCCCC], chr1:1116223-1116230[GGGGGGGG] are some mono nucleotide repeats we are interested in looking at, but they are not on the UCSC list. Thus, I was trying to generate a list using TRF, but still, some of the repeats I was trying to get did not get reported by TRF, e.g., chr1:981861-981868[CCCCCCCC], with the default parameters. Can someone provide some insights here:

  1. Is there any place where I can download a really 'comprehensive' simple repeats list from?
  2. If no to question #1, what would be the best way to curate such a list? Is running tools like TRF or RepeatMasker a good idea?
  3. If TRF is something you would suggest, how should I make it report these mono-nucleotide repeats that I was missiong with the default parameters?

Thanks

simple repeats reference genome • 3.8k views
ADD COMMENT
0
Entering edit mode

Have you looked at the UCSC Table browser? Check in the group "Repeats". There are multiple options available that you can download the data for.

ADD REPLY
3
Entering edit mode
6.7 years ago

the following C program will find the simple repeats:

compile

gcc biostar267241.c

example:

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz" | gunzip -c | ./a.out | grep -E '(981861|1116223)'
chr1    981861  981868  C[7]
chr1    1116223 1116230 G[7]
ADD COMMENT
0
Entering edit mode

Thank you so much for posting the C program. It seems to work perfect, but I have another question. Does this program find only mono- repeats or any repeats with total len >5bp?

ADD REPLY
0
Entering edit mode

mono-repeat (same base) of len > 5

ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY
2
Entering edit mode
6.7 years ago

You could download repeats for hg19 from the Repeatmasker folks and convert to BED with convert2bed to do set operations:

$ wget -qO- http://www.repeatmasker.org/genomes/hg19/RepeatMasker-rm405-db20140131/hg19.fa.out.gz | gunzip -c - | convert2bed --input=rmsk - > hg19.fa.out.bed

You could do ad-hoc searches with bedops, piping in your region of interest:

$ echo -e 'chr1\t981861\t981868' | bedops -e 1 hg19.fa.out.bed -

Or pass in a file of regions of interest:

$ bedops -e 1 hg19.fa.out.bed roi.bed > answer.bed

Perhaps you could use this with Pierre's binary to construct results with simple repeats and more complex repeat hits.

ADD COMMENT
0
Entering edit mode

Thank you, Alex, that's a great resource, and I have downloaded the repeat masker (RM) file, I think combining the list generated using Pierre's code with this file will give a good starting list.

However, I am still trying to understand why, even this RM file is missing some simple repeats, e.g., a trinucleotide repeat chr1:6680069-6680085 [GAA]n. For those of you who understands RM well, is there some criteria for a simple repeat to be included in the final RM list? e.g., copy number of the unit needs to be >=10, or something like that? as most of these 'longer repeats' are all present in the RM file.

ADD REPLY
0
Entering edit mode

It's unclear to me what parameters were used to generate these files. The best people to ask would probably be the Repeatmasker folks.

ADD REPLY

Login before adding your answer.

Traffic: 2252 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6