Question

Is transcript linked with the reference genome

0

Entering edit mode

2.2 years ago

ManuelDB ▴ 80

If I have been given a list of transcripts and I want to create a bed file to target an NGS pipeline I am creating. The first question I have is there is different transcripts ID for the same regions depending on in which reference genome we work, right? If so, how can I know that information, I mean how can I know if the transcripts I have belong to hg19 or hg38?

This is what I have

NM_000090.3 COL3A1

NM_000138.4 FBN1

NM_000169.2 GLA

NM_000218.2 KCNQ1

...

transcripts NGS bed_file • 999 views

ADD COMMENT • link updated 2.2 years ago by supertech ▴ 180 • written 2.2 years ago by ManuelDB ▴ 80

0

Entering edit mode

2.2 years ago

supertech ▴ 180

This is not a direct answer to your question but I believe it would be beneficial to read it: RefSeq Frequently Asked Questions (FAQ)

ADD COMMENT • link 2.2 years ago by supertech ▴ 180

score 2 · Accepted Answer · 2022-02-05

The first question I have is there is different transcripts ID for the same regions depending on in which reference genome we work, right?

Not necessarily. A transcript can be annotated on multiple genome assemblies (hg19, hg38 for example). And a transcript can be updated over time to make changes to the sequence (trimming/extending UTRs) and you will end up with a situation where an older version of the transcript is annotated only on an older version of the genome.

Let us use NM_000090.3 as an example. The latest version of this transcript is NM_000090.4 which was created in Dec 2019. Any genome annotations released by NCBI RefSeq after Dec 2019 will include the latest version of the transcript but not the ones released prior to that. Now, NCBI RefSeq annotated both hg19 (GRCh37) as well as hg38 (GRCh38) after Dec 2019, so the latest annotations both of the genome assemblies (Annotation Release 105.20201022 for GRCh37 and Annotation Release 109.20211119 for GRCh38) have the transcript version NM_000090.4. The last annotation that included NM_000090.3 was 105.20190906 (for GRCh37, released in Sep 2019) and 109.20190905 (for GRCh38, released in Sep 2019).

If so, how can I know that information, I mean how can I know if the transcripts I have belong to hg19 or hg38?

As I have described with the example above, a given transcript can be annotated on multiple assemblies. So, knowing just the transcript identifier does not tell you whether it was annotated on, say, hg19 or hg38 or both. Such information is not readily available either; it will need some amount of work parsing the GFF3 files to extract the genes, transcripts and proteins included in a single annotation. Not impossible, but not trivial either.

I want to create a bed file to target an NGS pipeline I am creating.

Going back to what started off this whole thing... you may want to ask if there is a compelling reason to not just use the latest assembly (GRCh38 aka hg38) and it's annotation (109.20211119). For some reason if you cannot use GRCh38 (hg38) and need data on the old assembly GRCh37 (hg19), you can use the latest annotation of this assembly 105.20201022 from Oct 2020.