Question: How to fetch a nucleotide sequence of a certain CDS from linux terminal by using GenBank accession number.
0
gravatar for arriyaz.nstu
10 weeks ago by
arriyaz.nstu0 wrote:

Let's say, I have two accession numbers AP017458 & GQ994935 of a virus's GenBank file. I want to download the nucleotide sequence (CDS) of ORF57 from these accession from Linux terminal. As per as I know efetch, esearch, etc command line code can be used to download sequence directly from the terminal. But, I need only the coding sequence of a specific ORF.

How can I download the coding sequence of a specific ORF by using accession number (all together or, at least one accession per each run)?

sequence gene • 132 views
ADD COMMENTlink modified 10 weeks ago by genomax87k • written 10 weeks ago by arriyaz.nstu0
1
gravatar for genomax
10 weeks ago by
genomax87k
United States
genomax87k wrote:

Using EntrezDirect (sequence truncated for space) :

$ esearch -db nuccore -query "AP017458" | elink -target protein | esummary | xtract -pattern DocumentSummary -if Title -contains "ORF57" -element Extra | awk -F "|" '{print $4}' | xargs -n 1 sh -c 'efetch -db protein -id $0 -format fasta_cds_na' 

>lcl|AP017458.1_cds_BAV17910.1_1 [protein=ORF57] [protein_id=BAV17910.1] [location=join(82084..82131,82240..83559)] [gbkey=CDS]
ATGGTACAAGCAATGATAGACATGGACATTATGAAGGGCATCCTAGAGGACTCTGTGTCCTCCTCTGAGT
TTGACGAATCGAGGGACGACGAGACGGACGCACCGACACTGGAAGACGAGCAATTGTCCGAACCCGCCGA
GCCTCCGGCAGACGAGCGCATGCGTGGTACCCAGTCGGCCCAGGGAATCCCACCCCCCCTGGGCCGCATC

There are two entries for the other accession

$ esearch -db nuccore -query "GQ994935" | elink -target protein | esummary | xtract -pattern DocumentSummary -if Title -contains "ORF57" -element Extra | awk -F "|" '{print $4}' | xargs -n 1 sh -c 'efetch -db protein -id $0 -format fasta_cds_na' 

>lcl|GQ994935.1_cds_ACY00456.1_1 [protein=ORF57] [protein_id=ACY00456.1] [location=join(81886..81934,82043..83361)] [gbkey=CDS]
ATGGTACAAGCAATGATAGACATGGACATTATGAAGGGCATCCTAGAGGACTCTGTGTCCTCCTCTGAGT
TTGACGAATCGAGGGACGACGAGACGGACGCACCGACACTGGAAGACGAGCAATTGTCCGAACCCGCCGA

>lcl|KF588566.1_cds_AKE33094.1_1 [gene=ORF57] [protein=ORF57] [protein_id=AKE33094.1] [location=join(81886..81934,82043..83361)] [gbkey=CDS]
ATGGTACAAGCAATGATAGACATGGACATTATGAAGGGCATCCTAGAGGACTCTGTGTCCTCCTCTGAGT
TTGACGAATCGAGGGACGACGAGACGGACGCACCGACACTGGAAGACGAGCAATTGTCCGAACCCGCCGA
GCCTCCGGCAGACGAGCGCATCCGTGGTACCCAGTCGGCCCAGGGAATCCCACCCCCCCTGGGCCGCATC
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by genomax87k

Hi, your code is just amazing. Thank you very much. Is it possible to download the CDS from more than one accession together in a single code? Becuase I have a very long list of accession numbers and I've to download CDS of ORF57 for all of them.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by arriyaz.nstu0
1

Absolutely. Use epost method :

 $ epost -db nuccore -input id_file | elink -target protein | esummary | xtract -pattern DocumentSummary -if Title -contains "ORF57" -element Extra | awk -F "|" '{print $4}' | xargs -n 1 sh -c 'efetch -db protein -id $0 -format fasta_cds_na'

id_file should contain one accession per line.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by genomax87k

You made my day. Thank you very much.

ADD REPLYlink written 10 weeks ago by arriyaz.nstu0

Please accept the answer (green check mark) then.

Upvote|Bookmark|Accept

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by genomax87k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1537 users visited in the last hour