unexpected output from efetch
1
0
Entering edit mode
2.2 years ago

Sorry for the bad title.

I am using efetch to programmatically retrieve protein sequences by using -id -chr_start and -chr_end. I am using efetch in loop but I noticed that I was getting more sequences than expected:

$ head coordinates

NZ_AJFJ01000034.1   85694   87193
NZ_AJFJ01000034.1   88084   89073
NZ_CP027541.1   2096426 2098054
NZ_SITV01000020.1   28544   30172
NZ_CP009496.1   2052819 2054447
NZ_SITZ01000020.1   110428  112056
NZ_SITQ01000020.1   110832  112460
NZ_SITX01000020.1   110428  112056

$ cat coordinates | while IFS=$'\t' read acn str stp
do
    echo $acn $str $stp
    efetch -db nuccore -format fasta_cds_aa -id "$acn" -chr_start "$str" -chr_stop "$stp" > $acn$str$stp.faa
done;

The coordinates I am using should give me back only one protein per line. However, for some set of coordinates I get two protein instead. I do not understand:

$ efetch -db nuccore -format fasta_cds_aa -id NZ_VMQU01000035.1 -chr_start 27743 -chr_stop 28750

>lcl|NZ_VMQU01000035.1_prot_WP_144950956.1_1 [locus_tag=FPZ47_RS10505] [protein=MmoB/DmpM family protein] [protein_id=WP_144950956.1] [location=complement(27456..27746)] [gbkey=CDS]
MTAAGVRKVGVDLQENSEDSRRIIEAIEADNPDCTVTHFPGLIKIQAPGQLVVRRETIEKLLSRPWETHE
FQMSIVTYYGNIQEWDDDEIVIAWDH
>lcl|NZ_VMQU01000035.1_prot_WP_144950958.1_2 [locus_tag=FPZ47_RS10510] [protein=phenol 2-monooxygenase] [protein_id=WP_144950958.1] [location=complement(27743..28750)] [gbkey=CDS]
MQYELRTQAIDPLRATFTPLVRRYGDKPATRYQEGTIDVQPVENFHYRPLWDSEREIYDERYSALRLTDP
YSFTDPRQYYYAPYVTARSQLFDAFAKTLDYVESRSLLANLPDAWRALAVSTLLPLRHYEGGAQLISSNG
ARFAYGTSIEQCLSFAAFDRIGNAQLLSRIGLALADGSAELLDQARGRWLGDEHLQPLRRFAEELLVEPD
WAVATVAWDVADQLIYQLVYQHLDKQALVSGASALSLLGQHLGDWFVDHRKWLNALIAAWLADAEHGPGN
AHVLRTTIDIWLPKATEAVSGLARAADEAADVGAVAAVERFVEQLRATTIEEKIA

Thank you!

efetch • 654 views
ADD COMMENT
2
Entering edit mode
2.2 years ago

I believe that your result indicates that the coding regions for the proteins overlap

we can check with bio like so

bio fetch NZ_VMQU01000035 | bio gff --id WP_144950956.1,WP_144950958.1

prints:

##gff-version 3
NZ_VMQU01000035.1       .       CDS     27456   27746   .       -       .       ID=1;Name=WP_144950956.1;Parent=WP_144950956.1
NZ_VMQU01000035.1       .       CDS     27743   28750   .       -       .       ID=2;Name=WP_144950958.1;Parent=WP_144950958.1

indeed the two CDS overlap

More on bio: https://www.bioinfo.help/

ADD COMMENT
0
Entering edit mode

I guess one possible solution is to change the fasta header into NZ_VMQU01000035.1 27743 28750 and then subset the multifasta file using coordinates

ADD REPLY

Login before adding your answer.

Traffic: 2138 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6