I'm looking for a way to extract the nucleotide sequences of NCBI GenBank records corresponding to specific annotated regions in the associated NCBI GenPept records (either manually, or ideally, programmatically using R package rentrez, FASTA format).
For example, this spike protein sequence has two regions annotated, corresponding to the S1 and S2 glycoproteins, that can be easily highlighted or isolated. But the corresponding nucleotide sequence GenBank entry doesn't feature that annotated region information, giving only the nucleotide sequence of the whole protein. Is there a way of cross-referencing these to only isolate the relevant sequence?
I don't think you can retrieve the nucleotide sequence just for those regions. They are annotated as regions and AFAIK you can only retrieve nucleotide sequence of entire CDS.
$ esearch -db protein -query "QBP43268" | efetch -format ft
>Feature gb|QBP43268.1|
1 1352 Protein
product S protein
234 721 Region
region Corona_S1
note Coronavirus S1 glycoprotein
db_xref CDD:279880
729 1351 Region
region Corona_S2
note Coronavirus S2 glycoprotein
db_xref CDD:279881
1 1352 CDS
product S protein
protein_id gb|QBP43268.1|
This is already a two-year-old question, but I was able to solve the same issue for bacterial sequences using Entrez Direct (the UNIX command line E-utilities).
esearch -db protein -organism bacteria -query "My query" |
efetch -format fasta_cds_na > output.fasta
This worked well enough for me, though I was downloading many sequences and ran into a few "EMPTY RESULT QUERY FAILURE" issues which I have no idea how to solve (maybe these would be GenPept entries with no corresponding Nucleotide links?).
Update: for your exact question, the solution would be:
esearch -db protein -query "QBP43268" |
efetch -format fasta_cds_na
Thanks for posting the solution. While this is a generic solution it is not applicable to the original question that was posted. Just wanted to make a note of that.
I don't think you can retrieve the nucleotide sequence just for those regions. They are annotated as
regions
and AFAIK you can only retrieve nucleotide sequence of entire CDS.