Question

Refseq Version in hg19 Human Reference Genome

0

Entering edit mode

5.8 years ago

adelemusa1 • 0

Hi.

I'm trying to retrieve transcript sequenceses from transcripts IDs. I need the same transcript version from the one used in the experiment that i'm considering, in which version is missing.

I know from the experiment description that raw reads were mapped to hg19 transcriptome, which was aggregated from UCSC RefSeq and Genecode v12 databases.

I've tried to use GTF file from hg19, but versions don't match. I'm working on a large dataset, so i'll need a easy and direct way to determinate the right versions.

Thank you!

Adelaide

Refseq genome gene sequence • 2.0k views

ADD COMMENT • link 5.8 years ago by adelemusa1 • 0

0

Entering edit mode

Why do you not want to use the latest version of the RefSeq transcripts? There may be very minor changes but in any case the latest version reflects the best information available at the present.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Because for each transcript i have a string of scores, each score referred to a single nucleotide, than it's important for me to know the correct sequence. Morover different versions change in lenght, the positional score information would be unuseful.

ADD REPLY • link 5.8 years ago by adelemusa1 • 0

score 1 · Answer 1 · 2018-07-17

1

Entering edit mode

5.8 years ago

GenoMax 141k

You can use NCBI's E-utilities (not unix command line utils) to retrieve this information.e.g. You can retrieve previous version, for example NM_011721.3 (current version is NM_011721.4) by using a URL in this format: https://www.ncbi.nlm.nih.gov/nuccore/NM_011721.3

If you need just sequence then you could do: https://www.ncbi.nlm.nih.gov/nuccore/NM_011721.3?report=fasta

ADD COMMENT • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Thank you, I did what you suggested but another problem came up. The lenght of some transcripts differ from the ones of my experimental data. I checked different versions but none of them seems to be right. I noticed that some NCBI transcripts have poly-A tails. Is there a way to clean the sequences? I already tried with strip(A) function, but it seems that some times my transcripts cointain some As at the ending, and removing also these the lenght doesn't match. I wrote a script that strips all As ending and than add the number of As for the right match with my transcript(max 10), if this doesn't match it changes version. in this way it seems to work but i can't be sure that i'm picking the right version and that the sequences is not altered. For me it's important that the sequence in all its nucleotides is the same of the one of the experiment, to avoid scores shifts that will invalidate all the dataset.

ADD REPLY • link 5.8 years ago by adelemusa1 • 0

0

Entering edit mode

The lenght of some transcripts differ from the ones of my experimental data.

Since the information at NCBI is archival (not changed, that is why the version numbers) all bets are probably off. Especially, if things don't match and you need to start modifying NCBI data so they do. It sounds like you did not do the analysis for the results you are working with. Do you have a way to know how the old results were analyzed? Are you able to repeat the analysis (probably very painful) with current/traceable information? You seem to have a special use case anyway.

ADD REPLY • link 5.8 years ago by GenoMax 141k