Question: How to extract all the simple repeats from the hg19 reference genome
0
gravatar for Jackie
11 days ago by
Jackie40
United States
Jackie40 wrote:

I am trying to get a comprehensive list of simple repeats (mono-, di-, tri-, tetra-) in the human genome (hg19). I have downloaded the simpleRepeat.txt.gz from UCSC, but seems it is missing some of the repeats we are interested in. For example, chr1:981861-981868[CCCCCCCC], chr1:1116223-1116230[GGGGGGGG] are some mono nucleotide repeats we are interested in looking at, but they are not on the UCSC list. Thus, I was trying to generate a list using TRF, but still, some of the repeats I was trying to get did not get reported by TRF, e.g., chr1:981861-981868[CCCCCCCC], with the default parameters. Can someone provide some insights here:

  1. Is there any place where I can download a really 'comprehensive' simple repeats list from?
  2. If no to question #1, what would be the best way to curate such a list? Is running tools like TRF or RepeatMasker a good idea?
  3. If TRF is something you would suggest, how should I make it report these mono-nucleotide repeats that I was missiong with the default parameters?

Thanks

ADD COMMENTlink modified 11 days ago by Alex Reynolds20k • written 11 days ago by Jackie40

Have you looked at the UCSC Table browser? Check in the group "Repeats". There are multiple options available that you can download the data for.

ADD REPLYlink written 11 days ago by genomax32k
2
gravatar for Pierre Lindenbaum
11 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum96k wrote:

the following C program will find the simple repeats:

compile

gcc biostar267241.c

example:

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz" | gunzip -c | ./a.out | grep -E '(981861|1116223)'
chr1    981861  981868  C[7]
chr1    1116223 1116230 G[7]
ADD COMMENTlink written 11 days ago by Pierre Lindenbaum96k

Thank you so much for posting the C program. It seems to work perfect, but I have another question. Does this program find only mono- repeats or any repeats with total len >5bp?

ADD REPLYlink written 11 days ago by Jackie40

mono-repeat (same base) of len > 5

ADD REPLYlink written 11 days ago by Pierre Lindenbaum96k

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 11 days ago by WouterDeCoster20k
2
gravatar for Alex Reynolds
11 days ago by
Alex Reynolds20k
Seattle, WA USA
Alex Reynolds20k wrote:

You could download repeats for hg19 from the Repeatmasker folks and convert to BED with convert2bed to do set operations:

$ wget -qO- http://www.repeatmasker.org/genomes/hg19/RepeatMasker-rm405-db20140131/hg19.fa.out.gz | gunzip -c - | convert2bed --input=rmsk - > hg19.fa.out.bed

You could do ad-hoc searches with bedops, piping in your region of interest:

$ echo -e 'chr1\t981861\t981868' | bedops -e 1 hg19.fa.out.bed -

Or pass in a file of regions of interest:

$ bedops -e 1 hg19.fa.out.bed roi.bed > answer.bed

Perhaps you could use this with Pierre's binary to construct results with simple repeats and more complex repeat hits.

ADD COMMENTlink written 11 days ago by Alex Reynolds20k

Thank you, Alex, that's a great resource, and I have downloaded the repeat masker (RM) file, I think combining the list generated using Pierre's code with this file will give a good starting list.

However, I am still trying to understand why, even this RM file is missing some simple repeats, e.g., a trinucleotide repeat chr1:6680069-6680085 [GAA]n. For those of you who understands RM well, is there some criteria for a simple repeat to be included in the final RM list? e.g., copy number of the unit needs to be >=10, or something like that? as most of these 'longer repeats' are all present in the RM file.

ADD REPLYlink written 11 days ago by Jackie40

It's unclear to me what parameters were used to generate these files. The best people to ask would probably be the Repeatmasker folks.

ADD REPLYlink written 11 days ago by Alex Reynolds20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 679 users visited in the last hour