11 months ago
alopex

I'm looking for a way to extract the nucleotide sequences of NCBI GenBank records corresponding to specific annotated regions in the associated NCBI GenPept records (either manually, or ideally, programmatically using R package rentrez, FASTA format).

For example, this spike protein sequence has two regions annotated, corresponding to the S1 and S2 glycoproteins, that can be easily highlighted or isolated. But the corresponding nucleotide sequence GenBank entry doesn't feature that annotated region information, giving only the nucleotide sequence of the whole protein. Is there a way of cross-referencing these to only isolate the relevant sequence?

I don't think you can retrieve the nucleotide sequence just for those regions. They are annotated as regions and AFAIK you can only retrieve nucleotide sequence of entire CDS.

$ esearch -db protein -query "QBP43268" | efetch -format ft
>Feature gb|QBP43268.1|
1   1352    Protein
            product S protein
234 721 Region
            region  Corona_S1
            note    Coronavirus S1 glycoprotein
            db_xref CDD:279880
729 1351    Region
            region  Corona_S2
            note    Coronavirus S2 glycoprotein
            db_xref CDD:279881
1   1352    CDS
            product S protein
            protein_id  gb|QBP43268.1|

