Mm9 Vs Mm10 [Why Is The Information Less In New Genome Build]
3
0
Entering edit mode
10.4 years ago

Hi folks,

According to my understanding, when a new genome is annotated, it should contain new information plus the old information though there might be a case, when the some entries are removed from the genome if they are updated and recognised later to be something else, than what was annotated.

The number of unique genes [refseq] in the mm9 assembly is ~22K and in the mm10, its ~15.5K, why this huge difference. I was planning to remap all the samples, with the new assembly and use it in the downstream processing, would it be helpful or I should wait for some time [if there are any planned updates to mm10].

Also, the co-ordinates for a same gene are different [eg: Adora1]

Thanks

P.S. From the NCBI release page;

Release notes: Major update made the to last MGSC release. All chromosome coordinates have changed. There is now some representation of the PAR regions on the X and Y chromosomes.

I think this co-ordinates change would impact a lot in the downstream analysis, in the cases of comparisons among mm9 & mm10.

genome ucsc chip-seq • 9.0k views
0
Entering edit mode

When you say 22K now 15.5K, what is the source of that information? UCSC, EBI, NCBI?

0
Entering edit mode

Sean, source is the refseq table from Ucsc for these 2 builds, I sorted the file and counted for the unique genes, under the column name name2

5
Entering edit mode
10.4 years ago
brentp 24k

From the UCSC test server:

shows that

mm9 has  Row Count: 28,661
mm10 has Row Count:  31,469


by choosing table knownCanonical and then clicking "Describe Table Schema".

You can do the same for transcript with knownGene, which shows:

55,419 for mm9
59,121 for mm10


so it looks like, at least from UCSC, there is new data in mm10.

0
Entering edit mode

You are right brent!! Thanks

2
Entering edit mode
10.4 years ago

The coordinates between genome builds change, by design, since sequence has been added, revised, and sometimes removed from the chromosomes. In general, all analyses will need to be on the same genome build, so some analyses might need to be redone. In some cases, the easiest thing to do is to redo the entire analysis. In some cases, a tool like the UCSC liftover tool will be good enough.

0
Entering edit mode

Thanks Sean, do have an idea on how they do it, is it manual or computational or what.

0
Entering edit mode

The steps are: 1) build the new genome (differs by organism) and 2) annotate the genome with features of interest. For step 2, UCSC, Ensembl, and NCBI each do their own annotation. Each feature set (transcripts, regulatory elements, miRNA, etc.) requires its own solution, so if you want to know details, the best bet is to pick your track or data set of interest and investigate. UCSC makes this easy since each track has a detailed description.

0
Entering edit mode

Great, Sean :)

0
Entering edit mode
10.4 years ago

My bad, just pulled the files again and the number of unique genes for mm9_refgene & mm10_refgene are 23334 & 23389, so 55 new genes are being added.

Problem could be, I copied the table before it was loaded completely on the page, or used sort -u -k13 | wc -l but should have used sort -u -k13,13 | wc -l.

Just another problem would be taking a decision whether to proceed with the new genome build or not, as the co-ordinates are different. This should effect all the previous analysis as Sean said.

Thanks guys.