Question

Forum:Survey: help define Gencode and NCBI primary transcripts

5

Entering edit mode

6.0 years ago

Emily 23k

Ensembl and NCBI have been working to align the GENCODE and RefSeq reference transcripts. As part of that effort, we are also developing plans to define a primary transcript for every gene as well as a minimal set of clinically relevant transcripts. To guide that effort, we have developed a small survey to get input on how to define the primary transcript and whether this would be important to your work.

The survey should only take 10 minutes or less and you will have the opportunity to sign up for follow-up info about this project if you are interested.

https://goo.gl/forms/OjEXtYGt1pxcukqp1

refseq ensembl gencode ncbi transcript • 2.3k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 6.0 years ago by Emily 23k

0

Entering edit mode

We had ~1900 unique users on Biostars in the last hour. Surely more of you can find the time to complete the survey :-D

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

Be fair, some of them work with proteins.

ADD REPLY • link 6.0 years ago by Emily 23k

0

Entering edit mode

One thing to consider is that wet lab scientists come to these tools to find sequences for their uses. It's already hard enough to reconcile the gene name reported in a paper (e.g. Hsc70) with the myriad of things with that name (e.g. the 20 or so HSPA8s) before you get to the transcripts.

For a wet lab scientist, a primary transcript could be a very nice thing to see but depending on how it is defined may be misleading or incorrect. Some people tend to think of it as a case where one transcript is "the right one", the one that is the wild-type one found in their cells/animals/etc, and the remaining transcripts are special cases in the sense of "if you needed that one, you'd know". This isn't correct from a bioinformatics standpoint, but in a larger scope it makes sense.

In some sense, a primary transcript (or is best defined by whatever transcript has historically been used experimentally or referenced in literature. Even if that transcript isn't the most abundant/contains some odd allele/etc, the most important information about that gene comes from these pubs and in particular the wet experiments. We may think that in bioinformatics we can just pick the longest/highest abundance/etc and be okay, but we often interpret the significance of our findings largely through what the literature tells us. If we cite papers that refer to transcript A to impart significance on our findings on transcript B, we're in trouble. The same goes for the wet lab biologist, if they clone in transcript B based on all the papers on transcript A, they're in trouble.

Not a new problem, but I'm wondering if identifying a primary transcript will, on average, worsen or improve this issue.

ADD REPLY • link 6.0 years ago by pld 5.1k

1

Entering edit mode

This is the logic behind considering this option. It's a bit of a Wild Wild West at the moment, with people picking the one transcript they're going to study by fairly arbitrary means, and don't always pick the same one. If an authority has defined this, at least it will solve one problem.

Also, people ask us for primary transcripts all the time.

ADD REPLY • link 6.0 years ago by Emily 23k

0

Entering edit mode

maybe relevant post here: How to tell which transcript is the canonical transcript?

ADD REPLY • link 6.0 years ago by steve ★ 3.5k

score 5 · Answer 1 · 2018-04-10

This is one of those things where the reality and desired course of action are divergent and data service providers seem to need to choose between what people think they want versus the complex realities of science. Are life scientists the proper audience to "democratically" decide what "primary" means?

In my opinion, the term "primary" leads people to believe that a subset of the transcripts is more important than the others - they will study these more, hence becoming a self-fulfilling prophecy of 'importance'. It sets back science rather than promoting it.

I can't see the benefit of a new terminology for things that are already defined. Clinically relevant, longest exons, high abundance, low abundance we all know what these words mean. Whatever temporary benefit of a seemingly consistent naming pattern might be, the information will start changing the next day. And now we have to deal with those changes via a new and potentially misleading term. Why not just call one set "clinically relevant (as of 2018)", the other "high abundance" etc and let people filter by those.

The real challenges are in matching/summarizing one data release versus the other (or across versions), finding out what the differences are in between, visualizing them easily.

What we really need are accurate transcripts, ways to annotate or filter transcripts based on observed abundances in tissues or conditions. What we need is information that helps cut down on the busy patchwork of "custom" little scripts to figure out simple information.

score 0 · Answer 2 · 2018-04-12

0

Entering edit mode

6.0 years ago

Emily 23k

Adding an answer to bump up. If you care about this at all, please fill in the survey. We're never going to please everybody but if you fill in the survey at least your voice will be heard.

ADD COMMENT • link 6.0 years ago by Emily 23k

score 0 · Answer 3 · 2018-04-17

0

Entering edit mode

6.0 years ago

Emily 23k

We will close this survey at midnight (BST) on Thursday. If you wish to have your say, you've got two days to do it.

ADD COMMENT • link 6.0 years ago by Emily 23k