longest transcript NCBI NM_ids
3
0
Entering edit mode
9.1 years ago
User6891 ▴ 330

Hi,

Is there a way to retrieve the longest transcripts for a list of 4000 genes? I want to use the NM_transcript ids from NCBI. I just need the NM_id for the longest transcript.

ucsc ncbi transcript gene • 5.7k views
ADD COMMENT
3
Entering edit mode

Start by breaking down the problem into sub-problems, like so:

  1. What is the available API that will take as input an identifier and give you the most minimal output from which length can be calculated or extracted? (Entrez e-query?)
  2. What is the optimal input you can provide? (Only gene names? Gene symbols? GI numbers?)
  3. For a given output, how can you process the output to extract/calculate the length?
  4. How can you automate this in a loop?

Solve these, and you've figured out how to automate pretty much any querying using NCBI.

Or, explore UCSC Genome Browser's underlying MySQL tables and check if SQL can help you make your job easier.

ADD REPLY
1
Entering edit mode

What Ram said, though in this particular instance the UCSC database might be annoying to use, since you have to calculate the transcript widths yourself (the most useful coordinates are the exon start/stop positions and those are all comma separated instead of being different entries). You might have better luck using biomart. You'll have to process the query, but it's simple enough to get the length of a large number of transcripts.

ADD REPLY
0
Entering edit mode

I will use Biomart tool to provide me with transcript start & stop, together with the gene names and the NM_id. I will then use my own script to calculate the length and keep only the longest transcript

ADD REPLY
1
Entering edit mode
9.1 years ago
Mohamed ▴ 70

Well, I am not that expert in such issues and until someone answer you properly, here is my comment. Go to NCBI, select Nucleotide and then type anything and press enter to activate nucleotide database search or to see Advanced option. Then, select sequence length and enter 70000 to 999999. (of course you have to enter other option to adjust the search for only mRNA). I got all the above from the following website where I learned this trick to search for longest (or shortest DNA/mRNA/protein) sequences:

http://wiki.bits.vib.be/index.php/Exercises_on_Genbank

(look under Exercise 3).

Regards,
Mohamed

ADD COMMENT
1
Entering edit mode

I'm sorry, but how is this relevant? The length filter makes little sense here. Also, the underlying problem is bulk querying efficiently, not finding relevant data with a manual search.

ADD REPLY
0
Entering edit mode
9.1 years ago
User6891 ▴ 330

So this information is not just easily retrievable from NCBI or UCSC?

ADD COMMENT
0
Entering edit mode

It definitely should be retrievable with a bit of manual effort, for 1 record. For 4000 records, it becomes a problem of scale.

ADD REPLY
0
Entering edit mode
9.1 years ago

I think it's possible in ENSEMBL via biomart. In filters select refseq id and put your id list in it. And in attributes select transcript length.

It should be easily done through their API

ADD COMMENT
0
Entering edit mode

Hi NicoBxl,

I am working on a similar project to the one discussed above, and I feel that I am very close to being able to do this according to your directions above. However, I am not able to find "transcript length" in attributes, and it also does not seem to be pulling up multiple RefSeq ID's for genes with many (i.e. DMD). Any suggestions?

Thank you!

Renee Bend

ADD REPLY

Login before adding your answer.

Traffic: 2706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6