How to fetch a nucleotide sequence of a certain CDS from linux terminal by using GenBank accession number.
1
0
Entering edit mode
3.9 years ago
arriyaz.nstu ▴ 30

Let's say, I have two accession numbers AP017458 & GQ994935 of a virus's GenBank file. I want to download the nucleotide sequence (CDS) of ORF57 from these accession from Linux terminal. As per as I know efetch, esearch, etc command line code can be used to download sequence directly from the terminal. But, I need only the coding sequence of a specific ORF.

How can I download the coding sequence of a specific ORF by using accession number (all together or, at least one accession per each run)?

gene sequence • 2.0k views
ADD COMMENT
2
Entering edit mode
3.9 years ago
GenoMax 141k

Using EntrezDirect (sequence truncated for space) :

$ esearch -db nuccore -query "AP017458" | elink -target protein | esummary | xtract -pattern DocumentSummary -if Title -contains "ORF57" -element Extra | awk -F "|" '{print $4}' | xargs -n 1 sh -c 'efetch -db protein -id $0 -format fasta_cds_na' 

>lcl|AP017458.1_cds_BAV17910.1_1 [protein=ORF57] [protein_id=BAV17910.1] [location=join(82084..82131,82240..83559)] [gbkey=CDS]
ATGGTACAAGCAATGATAGACATGGACATTATGAAGGGCATCCTAGAGGACTCTGTGTCCTCCTCTGAGT
TTGACGAATCGAGGGACGACGAGACGGACGCACCGACACTGGAAGACGAGCAATTGTCCGAACCCGCCGA
GCCTCCGGCAGACGAGCGCATGCGTGGTACCCAGTCGGCCCAGGGAATCCCACCCCCCCTGGGCCGCATC

There are two entries for the other accession

$ esearch -db nuccore -query "GQ994935" | elink -target protein | esummary | xtract -pattern DocumentSummary -if Title -contains "ORF57" -element Extra | awk -F "|" '{print $4}' | xargs -n 1 sh -c 'efetch -db protein -id $0 -format fasta_cds_na' 

>lcl|GQ994935.1_cds_ACY00456.1_1 [protein=ORF57] [protein_id=ACY00456.1] [location=join(81886..81934,82043..83361)] [gbkey=CDS]
ATGGTACAAGCAATGATAGACATGGACATTATGAAGGGCATCCTAGAGGACTCTGTGTCCTCCTCTGAGT
TTGACGAATCGAGGGACGACGAGACGGACGCACCGACACTGGAAGACGAGCAATTGTCCGAACCCGCCGA

>lcl|KF588566.1_cds_AKE33094.1_1 [gene=ORF57] [protein=ORF57] [protein_id=AKE33094.1] [location=join(81886..81934,82043..83361)] [gbkey=CDS]
ATGGTACAAGCAATGATAGACATGGACATTATGAAGGGCATCCTAGAGGACTCTGTGTCCTCCTCTGAGT
TTGACGAATCGAGGGACGACGAGACGGACGCACCGACACTGGAAGACGAGCAATTGTCCGAACCCGCCGA
GCCTCCGGCAGACGAGCGCATCCGTGGTACCCAGTCGGCCCAGGGAATCCCACCCCCCCTGGGCCGCATC
ADD COMMENT
0
Entering edit mode

Hi, your code is just amazing. Thank you very much. Is it possible to download the CDS from more than one accession together in a single code? Becuase I have a very long list of accession numbers and I've to download CDS of ORF57 for all of them.

ADD REPLY
1
Entering edit mode

Absolutely. Use epost method :

 $ epost -db nuccore -input id_file | elink -target protein | esummary | xtract -pattern DocumentSummary -if Title -contains "ORF57" -element Extra | awk -F "|" '{print $4}' | xargs -n 1 sh -c 'efetch -db protein -id $0 -format fasta_cds_na'

id_file should contain one accession per line.

ADD REPLY
0
Entering edit mode

You made my day. Thank you very much.

ADD REPLY
0
Entering edit mode

Please accept the answer (green check mark) then.

Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6