Finding Reference Sequences And Interpretation Of Data From Repeatmasker Ucsc Tables
2
9
Entering edit mode
11.3 years ago
User 9996 ▴ 830

I downloaded the RepeatMasker track for the mm9 genome from the Tables section of the UCSC genome browser. I get entries like these:

#bin    swScore    milliDiv    milliDel    milliIns    genoName    genoStart    genoEnd    genoLeft    strand    repName    repClass    repFamily    repStart    repEnd    repLeft    id
607    687    174    0    0    chr1    3000001    3000156    -194195276    -    L1_Mur2    LINE    L1    -4311567    1413    1


My question is: how can I know what is the reference sequence that this particular genomic location was aligned to? I understand that the Smith-Waterman alignment score is a result of aligning this piece of the genome to the reference, but it is the actual reference sequence that of the particular repeat that I'm trying to find. How can this be accessed?

EDIT: To elaborate on this, consider the above example where an "L1_Mur2" type repetitive element is described. I got the most recent RepBase I could find (release 20110920), and this yields:

\$ grep "L1_Mur2" RepeatMaskerLib.embl ID L1_Mur2_5end repeatmasker; DNA; ???; 970 BP. CC L1_Mur2_5end DNA DE RepbaseID: L1_Mur2_5end ID L1_Mur2_orf2 repeatmasker; DNA; ???; 4675 BP. CC L1_Mur2_orf2 DNA DE RepbaseID: L1_Mur2_orf2 ID L1_Mur2_3end repeatmasker; DNA; ???; 1463 BP. CC L1_Mur2_3end DNA DE RepbaseID: L1_Mur2_3end

Therefore, there are three entries for this type of element: a consensus for the 5' end, the 3' end, and for ORF2 of the element. This makes sense, but how can I tell from the UCSC repeatmasker line which of those was the consensus that the element in question was aligned to? I.e., how can I tell if it was a match to the 5' end, the 3' end or the ORF? These would be very different and I don't know how to tell that. Any ideas on this?

Also, are the coordinates repStart, repEnd, repLeft in the coordinate space of the reference or of the genome? It sounds to me from googling that it is the former, but in that case it seems impossible to interpret without having the reference sequence -- we don't know how long it is, for example, just by looking at this table, right?

Finally, I was hoping someone can explain what the milliDiv, milliIns, and milliDev fields are and what those units mean.

Thank you.

repeatmasker ucsc ucsc browser repeats • 6.4k views
14
Entering edit mode
11.3 years ago

Q: how can I know what is the reference sequence that this particular genomic location was aligned to?

Repeat annotations at UCSC are based on custom repeatmasker libraries, which are derived from reference sequences in the RepBase database. Repbase provides repeatmasker libraries here. Unfortunately (like so many things having to do with repeatmasker), there is not 100% openness and clarity about the exact libraries used by UCSC for repeatmasking:

UCSC has used the most current versions of the RepeatMasker software and repeat libraries available to generate these data. Note that these versions may be newer than those that are publicly available on the Internet.

So basically this says that you cannot identify the exact reference used with 100% reproducibility.

That said, it is likely that you can find a version of the reference sequence you are looking for (with the caveat that you cannot be sure that it is the exact sequence used) in the repeatmasker library from Repbase. Unfortunately (again), you cannot simply look up the reference sequence name in Repbase since there is not a 1-to-1 mapping of reference sequences in Repbase and the Repbase repeatmasker libraries

Q: are the coordinates repStart, repEnd, repLeft in the coordinate space of the reference or of the genome? ...it seems impossible to interpret...

These are coordinates of the match in the reference sequence. The reason you are having difficulty interpreting the coordinates is that repeatmasker represents reference coordinate systems differently on the + and - strands, e.g. from http://www.repeatmasker.org/webrepeatmaskerhelp.html

 1306 15.6  6.2  0.0 HSU08988  6563  6781  (22462) C  MER7A    DNA/MER2_type    (0)   336   103
12204 10.0  2.4  1.8 HSU08988  6782  7714  (21529) C  TIGGER1  DNA/MER2_type    (0)  2418  1493
279  3.0  0.0  0.0 HSU08988  7719  7751  (21492) +  (TTTTA)n Simple_repeat      1    33   (0)
1765 13.4  6.5  1.8 HSU08988  7752  8022  (21221) C  AluSx    SINE/Alu        (23)   289     1
12204 10.0  2.4  1.8 HSU08988  8023  8694  (20549) C  TIGGER1  DNA/MER2_type  (925)  1493   827
1984 11.1  0.3  0.7 HSU08988  8695  9000  (20243) C  AluSg    SINE/Alu         (5)   305     1
12204 10.0  2.4  1.8 HSU08988  9001  9695  (19548) C  TIGGER1  DNA/MER2_type (1591)   827     2
711 21.2  1.4  0.0 HSU08988  9696  9816  (19427) C  MER7A    DNA/MER2_type  (224)   122     2


As far as I understand things, on the + strand, repeatmasker hits are naturally annotated in the last 3 columns as: 1) start position of match in reference sequence; 2) end position of match in reference sequence; 3) number of bases in the reference repeat sequence remaining at the end of a match.

However, on the C (= complementary) strand, these columns have a different meaning. On the - strand, the last three columns of a repeatmasker hit mean: 1) number of bases in the complement of the reference repeat sequence remaining at the end of a match; 2) end position of match in reference sequence; 3) start position of match in reference sequence. (NB: this is extremely bad practice since it violates the assumptions of a data table)

Sadly, UCSC have parsed the repeatmasker output and loaded this into their tables directly, so the reference columns in the UCSC repeat tables also do not have a consistently interpretable meaning: http://genome.ucsc.edu/cgi-bin/hgTables?db=mm9&hgta_group=varRep&hgta_track=rmsk&hgta_table=rmsk&hgta_doSchema=describe+table+schema

UCSC are aware of this, and I posted a similar query to the helpdesk as your nearly 5 years ago but this has not been fixed: https://lists.soe.ucsc.edu/pipermail/genome/2007-February/012756.html

Q: Can someone can explain what the milliDiv, milliIns, and milliDev fields are and what those units mean.

milliDiv - Base mismatches in parts per thousand
milliDel - Bases deleted in parts per thousand
milliIns - Bases inserted in parts per thousand


These are the 2nd, 3rd and 4th columns from the repeatmasker output multiplied by 1000:

% substitutions in matching region compared to the consensus
% of bases opposite a gap in the query sequence (deleted bp)
% of bases opposite a gap in the repeat consensus (inserted bp)


This conversion is done to put percentages with one decimal unit into an integer scale.

0
Entering edit mode

Thanks so much for the great explanation - I added a follow up question on this, giving an example where it's not clear to me how to map the ucsc table to repbase. Any ideas on this??

0
Entering edit mode

Awesome reply. Everything makes sense now :)

0
Entering edit mode

Hi Casey, thanks for the explanation. Just one thing I found is that the 2nd, 3rd, and 4th columns of the output are not multiplied by 1000. It is actually 10, not 1000. Please refer Greg's comment at http://genome.soe.ucsc.narkive.com/1AQHfEqU/question-about-the-repeatmasker-track.

1
Entering edit mode
11.3 years ago

Thanks a lot Casey for this nice post. I am also interested in repeats and how they are reported and can be better analyzed.

About your initial question: "what is the reference sequence that this particular genomic location", I use bedtools ('fastaFromBed') to retrieve such intervals and I am quite please with this great toolbox. You would of course need to make first a proper bed from the scrambled coordinate raw UCSC data. Is this what you asked? Cheers, Stephane

0
Entering edit mode

Thanks for the bedtools suggestion, Stephane!