Question: Mm9 Vs Mm10 [Why Is The Information Less In New Genome Build]
0
gravatar for Sukhdeep Singh
7.4 years ago by
Sukhdeep Singh10.0k
Netherlands
Sukhdeep Singh10.0k wrote:

Hi folks,

According to my understanding, when a new genome is annotated, it should contain new information plus the old information though there might be a case, when the some entries are removed from the genome if they are updated and recognised later to be something else, than what was annotated.

The number of unique genes [refseq] in the mm9 assembly is ~22K and in the mm10, its ~15.5K, why this huge difference. I was planning to remap all the samples, with the new assembly and use it in the downstream processing, would it be helpful or I should wait for some time [if there are any planned updates to mm10].

Also, the co-ordinates for a same gene are different [eg: Adora1]

Thanks

P.S. From the NCBI release page;

Release notes: Major update made the to last MGSC release. All chromosome coordinates have changed. There is now some representation of the PAR regions on the X and Y chromosomes.

I think this co-ordinates change would impact a lot in the downstream analysis, in the cases of comparisons among mm9 & mm10.

genome chip-seq ucsc • 6.8k views
ADD COMMENTlink modified 6.5 years ago by Biostar ♦♦ 20 • written 7.4 years ago by Sukhdeep Singh10.0k

When you say 22K now 15.5K, what is the source of that information? UCSC, EBI, NCBI?

ADD REPLYlink written 7.4 years ago by Sean Davis25k

Sean, source is the refseq table from Ucsc for these 2 builds, I sorted the file and counted for the unique genes, under the column name name2

ADD REPLYlink written 7.4 years ago by Sukhdeep Singh10.0k
5
gravatar for brentp
7.4 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

From the UCSC test server:

shows that

mm9 has  Row Count: 28,661
mm10 has Row Count:  31,469

by choosing table knownCanonical and then clicking "Describe Table Schema".

You can do the same for transcript with knownGene, which shows:

55,419 for mm9
59,121 for mm10

so it looks like, at least from UCSC, there is new data in mm10.

ADD COMMENTlink modified 3 months ago by RamRS25k • written 7.4 years ago by brentp23k

You are right brent!! Thanks

ADD REPLYlink written 7.4 years ago by Sukhdeep Singh10.0k
2
gravatar for Sean Davis
7.4 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

The coordinates between genome builds change, by design, since sequence has been added, revised, and sometimes removed from the chromosomes. In general, all analyses will need to be on the same genome build, so some analyses might need to be redone. In some cases, the easiest thing to do is to redo the entire analysis. In some cases, a tool like the UCSC liftover tool will be good enough.

ADD COMMENTlink written 7.4 years ago by Sean Davis25k

Thanks Sean, do have an idea on how they do it, is it manual or computational or what.

ADD REPLYlink written 7.4 years ago by Sukhdeep Singh10.0k

The steps are: 1) build the new genome (differs by organism) and 2) annotate the genome with features of interest. For step 2, UCSC, Ensembl, and NCBI each do their own annotation. Each feature set (transcripts, regulatory elements, miRNA, etc.) requires its own solution, so if you want to know details, the best bet is to pick your track or data set of interest and investigate. UCSC makes this easy since each track has a detailed description.

ADD REPLYlink written 7.4 years ago by Sean Davis25k

Great, Sean :)

ADD REPLYlink written 7.3 years ago by Sukhdeep Singh10.0k
0
gravatar for Sukhdeep Singh
7.4 years ago by
Sukhdeep Singh10.0k
Netherlands
Sukhdeep Singh10.0k wrote:

My bad, just pulled the files again and the number of unique genes for mm9refgene & mm10refgene are 23334 & 23389, so 55 new genes are being added. Problem could be, I copied the table before it was loaded completely on the page, or used sort -u -k13 | wc -l but should have used sort -u -k13,13 | wc -l. Just another problem would be taking a decision whether to proceed with the new genome build or not, as the co-ordinates are different. This should effect all the previous analysis as Sean said.

Thanks guys.

ADD COMMENTlink written 7.4 years ago by Sukhdeep Singh10.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1640 users visited in the last hour