Interpreting Ucsc Net Files
2
0
Entering edit mode
11.0 years ago

I want to find genomic regions conserved between species. Therefore I downloaded the mm9rn4 net file from UCSC, as I was adviced by my professor. However, I do not understand how to interpret the net format (The UCSC description of the file format is located here: http://genome.ucsc.edu/goldenPath/help/net.html) A few questions follow:

1) There are two types of "Classes": gap and fill. Does fill mean an overlapping region in the alignment?

2) The fill regions are of different lengths. I.e.

fill 3000305 14924932 chr5 - 106694 16612310

Means that the region in mouse is 14924932 long and the region in rats is 16612310 long.

So even though I know that region x in mm9 "equals" region y in rn4, these regions are of different length. Therefore I cannot know that the subregion that begins 5000 basepairs after the start of region x "equals" the subregion that begins 5000 nuclotides after the start of region y or what?

3) One field describes the "relative orientation" between target and query species. It is either + or -. What does this mean? How should a plus sign be interpreted?

Thanks for the patience.

Edit: I see the format is nested, so the above descriptions/thoughts are probably way off.

ucsc • 3.2k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
2
Entering edit mode
11.0 years ago
Emily 23k

Hi, I can't answer any specific UCSC file-format questions, but I can certainly help you with general alignment questions. So I don't know the answer to question 1.

2) Aligned sequences are not perfectly aligned. Eg, these two sequences are aligned:

ATCG-TCTAG
ATCGGTCTAG

There will be gaps in the alignment. There will be places where one sequence has a few bases missing compared to the other sequence. In my example, a 9bp sequence is aligned to a 10bp sequence. It is perfectly fine for 15Mb sequence to align to a 17Mb sequence.

If you want to see if more specific regions are aligned, then these files might not be where you want to look.

The Ensembl Alignments (text) view might give you more of what you need:

http://www.ensembl.org/Mus_musculus/Location/Compara_Alignments?align=615;db=core;r=5:3000305-3005305

3) Relative orientation refers to the direction of the region in one species compared to the direction in the other species. The end of the chromosome that we start numbering from is pretty well arbitrary.

If we take a similar example to before. In one of my species the + strand sequence might be ATCGGTCTAG, whereas the + strand sequence in the other species is CTAGACCGAT. These sequences align perfectly, if you flip one of them over, so in this case the relative orientation would be -. If, however, they sequence on both positive strands is ATCGGTCTAG then the relative orientation is +.

ADD COMMENT
1
Entering edit mode
11.0 years ago
KCC ★ 4.1k

Do you need regions conserved between just two species, or do you need regions conserved between many species? Do you just need to know that it's conserved or do you what the precise regions in each species that align with each other.

  1. If it's just between two particular species, I would advise using the chain file for that pair of species. For instance, you can look at the positions that align between just mouse and human: http://hgdownload.cse.ucsc.edu/goldenPath/mm9/vsHg19/ UCSC has alignments between all vertebrates and all placental mammals and this kind of alignment wouldn't take any of that information into account.

  2. If you don't actually care about the regions in the other species that align and you just want a measure of how conserved the residues in your target species are, then you can look at the phyloP or phastcons wig files, http://hgdownload.cse.ucsc.edu/goldenPath/mm9/phyloP30way/ and http://hgdownload.cse.ucsc.edu/goldenPath/mm9/phastCons30way/ For each position in the genome that aligns, an attempt is made to score it as conserved in other closely related species. Positions that don't align are blank. So, for instance, you can pick a region you care about and generate a graph of you conserved the residues across the region.

  3. Finally, if you care about the precise sequences that align between species, you can use the maf file, http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/maf/ I found this format awkward to work with, but used a feature in Galaxy to convert maf to fasta and that was a lot easier to manipulate.

I don't know much about the net format. So, I can't speak about that aspect. I did spend some time trying to figure them out and I found these pages seemed like they might be helpful: http://genomewiki.ucsc.edu/index.php/Chains_Nets and http://www.pnas.org/content/100/20/11484.full

ADD COMMENT

Login before adding your answer.

Traffic: 2023 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6