Question: Refseq Version in hg19 Human Reference Genome
0
gravatar for adelemusa1
8 months ago by
adelemusa10
adelemusa10 wrote:

Hi.

I'm trying to retrieve transcript sequenceses from transcripts IDs. I need the same transcript version from the one used in the experiment that i'm considering, in which version is missing.

I know from the experiment description that raw reads were mapped to hg19 transcriptome, which was aggregated from UCSC RefSeq and Genecode v12 databases.

I've tried to use GTF file from hg19, but versions don't match. I'm working on a large dataset, so i'll need a easy and direct way to determinate the right versions.

Thank you!

Adelaide

refseq sequence gene genome • 396 views
ADD COMMENTlink modified 8 months ago • written 8 months ago by adelemusa10

Why do you not want to use the latest version of the RefSeq transcripts? There may be very minor changes but in any case the latest version reflects the best information available at the present.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax63k

Because for each transcript i have a string of scores, each score referred to a single nucleotide, than it's important for me to know the correct sequence. Morover different versions change in lenght, the positional score information would be unuseful.

ADD REPLYlink written 8 months ago by adelemusa10
1
gravatar for genomax
8 months ago by
genomax63k
United States
genomax63k wrote:

You can use NCBI's E-utilities (not unix command line utils) to retrieve this information.e.g. You can retrieve previous version, for example NM_011721.3 (current version is NM_011721.4) by using a URL in this format: https://www.ncbi.nlm.nih.gov/nuccore/NM_011721.3

If you need just sequence then you could do: https://www.ncbi.nlm.nih.gov/nuccore/NM_011721.3?report=fasta

ADD COMMENTlink modified 8 months ago • written 8 months ago by genomax63k

Thank you, I did what you suggested but another problem came up. The lenght of some transcripts differ from the ones of my experimental data. I checked different versions but none of them seems to be right. I noticed that some NCBI transcripts have poly-A tails. Is there a way to clean the sequences? I already tried with strip(A) function, but it seems that some times my transcripts cointain some As at the ending, and removing also these the lenght doesn't match. I wrote a script that strips all As ending and than add the number of As for the right match with my transcript(max 10), if this doesn't match it changes version. in this way it seems to work but i can't be sure that i'm picking the right version and that the sequences is not altered. For me it's important that the sequence in all its nucleotides is the same of the one of the experiment, to avoid scores shifts that will invalidate all the dataset.

ADD REPLYlink written 8 months ago by adelemusa10

The lenght of some transcripts differ from the ones of my experimental data.

Since the information at NCBI is archival (not changed, that is why the version numbers) all bets are probably off. Especially, if things don't match and you need to start modifying NCBI data so they do. It sounds like you did not do the analysis for the results you are working with. Do you have a way to know how the old results were analyzed? Are you able to repeat the analysis (probably very painful) with current/traceable information? You seem to have a special use case anyway.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax63k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1941 users visited in the last hour