Question: The Way To Write Script To Validate If The Given Transcript Id Is The Latest Version
gravatar for jessada
8.7 years ago by
jessada130 wrote:

My data at VariBench has its transcript ID along with the version and I want pick only the ones with latest transcript ID. Are there any place that I can download transcript database or any online webservice.

ncbi transcript snp • 2.2k views
ADD COMMENTlink modified 8.7 years ago by Malachi Griffith18k • written 8.7 years ago by jessada130
gravatar for Malachi Griffith
8.7 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith18k wrote:

Those look like Refseq transcript IDs. You can download the current version of Refseq here: Refseq vertebrate mammalian. The *.rna.gbff.gz files in this directory contain a GenBank record for each Refseq ID and should specify the latest version. You would just need to grab the 'ACCESSION' and 'VERSION' values for each record. For example:

ACCESSION   XM_002714324
VERSION     XM_002714324.1  GI:291395911

Another option would be to use the NCBI E-utilities. For example, use esearch to get the uid for each Refseq ID and use it again to get the Refseq ID with current latest version number.

The following returns an XML for 'NM_000014' (note that no version is specified here) containing the uid '66932946':

The following returns an XML for the uid '66932946':

This XML contains a line: gi|66932946|ref|NM_000014.4|[66932946]

Telling you that currently this Refseq transcript is on version 4. Of course, you would need a script to automate this process for the number of records that you have.

ADD COMMENTlink written 8.7 years ago by Malachi Griffith18k

Can you explain more about the directory structure of the files in REfseq vertebrate mammalian? I saw 100 sets of files there. And it has around 6-10 files in each set. If I only need mRNA transcript of human, which groups of files should I d/l? FYI, I'm really new in biology but very strong background in com sci.

ADD REPLYlink written 8.7 years ago by jessada130

There are 6 data sets, each with a specific file format represented in the Refseq FTP directory. Each of these 6 data sets is divided into 144 blocks to avoid large blocks. This is sort of explained here: The six file types are: genomic.fna (genome data in fasta nucleic acid format), genomic.gbff (genome data in genbank flat file format), protein.faa (protein data as fasta amino acid), protein.gpff (protein data as genprot flat file), rna.fna (rna data as fasta nucleic acid), rna.gbff (rna as genbank flat file)

ADD REPLYlink written 8.7 years ago by Malachi Griffith18k

If you go the Refseq FTP route, it might be more convenient to work with the human specific files here:

ADD REPLYlink written 8.7 years ago by Malachi Griffith18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 915 users visited in the last hour