Retrieve gene length and cds locus informartion
2
0
Entering edit mode
5.7 years ago
MAPK ★ 2.1k

I have been trying to retrieve cds start and end position and gene length for about 1000 protein accessions (eg, ALD89117.1, ALD89128.1, ALD89126.1, ANR02692.1,AVA17449.1) I have as input. Would someone be kind enough to share code or expertise telling how to get this done? Thanks

ncbi batch_entrez • 1.1k views
ADD COMMENT
1
Entering edit mode

@Sej provided an answer this morning: C: download genbank sequences with exon sequences highlighted

Modify as necessay. She is referring to NCBI unix utils.

ADD REPLY
5
Entering edit mode
5.7 years ago
GenoMax 141k
$ esearch -db protein -query "AVA17449.1" | elink -target nuccore | efetch -format ft
>Feature gb|MG256173.1|
473 2857    CDS
            product RNA-dependent RNA polymerase
            transl_table    4
            protein_id  gb|AVA17449.1|

$ esearch -db protein -query "ALD89117.1" | elink -target nuccore | efetch -format ft
>Feature gb|KP900907.1|
2812    572 CDS
            product RNA-dependent RNA polymerase
            transl_table    4
            protein_id  gb|ALD89117.1|
ADD COMMENT
2
Entering edit mode
5.7 years ago
h.mon 35k

You don't even tell us what those identifiers are, which would make the task of providing advice easier.

There are plenty of solutions around, for example using GenomicRanges and the "org.eg.db" packages, or if you have a GTF you can follow Tutorial: Extract Total Non-Overlapping Exon Length Per Gene With Bioconductor, you can use biomaRt as well - see this comment for hints. Finally, although Obtaining Exon Lengths: is really old, I think the answers still work.

ADD COMMENT
0
Entering edit mode

Thanks, I have edited the question with accessions.

ADD REPLY

Login before adding your answer.

Traffic: 1556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6