Question

Using ape package in R to pull genbank sequences

0

Entering edit mode

6.0 years ago

krc3004 ▴ 20

Hi all- I am trying use the ape package in R to pull protein sequences from GenBank. The sequence I'd like to retrieve is the following: https://www.ncbi.nlm.nih.gov/protein/AAG26087.1

To do this I used the following code:

library(ape)
kl_seq = read.GenBank("AAG26087.1", as.character = TRUE)

And the sequence I obtained was:

$AAG26087.1
  [1] "m" "s" "r" "s" "s" "k" "r" "n" "r" "d" "g" "r" "g" "w" "v" "n" "g" "r" "k" "k" "d" "r" "d" "r" "k" "k" "k" "k" "k" "k" "d" "n" "w" "g" "n" "k" "g" "g" "k" "k" "d"
 [42] "k" "d" "g" "g" "a" "a" "k" "r" "a" "r" "t" "d" "m" "d" "s" "g" "g" "k" "r" "r" "g" "g" "s" "d" "k" "r" "d" "h" "r" "r" "r" "k" "a" "n" "k" "r" "k" "a" "a" "g" "g"
 [83] "k" "h" "s" "k" "k" "r" "t" "d" "r" "r" "r" "r" "t" "a" "g" "s" "v" "g" "g" "v" "n" "g" "g" "s" "r" "g" "a" "g" "g" "g" "v" "n" "m" "s" "v" "s" "s" "r" "t" "g" "g"
[124] "d" "v" "r" "g" "n" "g" "w" "d" "a" "d" "s" "s" "c" "r"

attr(,"species")
[1] "Hepatitis_delta_virus"

However, this sequence clearly does not match the sequence of length 214 aa in the above link. I'm wondering if there's something I'm missing here- perhaps there is another argument, function, or package I should use to pull the correct sequence? Is there any way to bulk download from GenBank directly, given a list of accession IDs? Any tips would be much appreciated. Thanks!

R genbank ncbi ape • 3.4k views

ADD COMMENT • link updated 6.0 years ago by Kevin Blighe 87k • written 6.0 years ago by krc3004 ▴ 20

1

Entering edit mode

Use NCBI's unix utilities.

efetch -db nuccore -id "AAG26087.1" -format fasta > AAG26087.fa

Replace ID's in a loop if you have multiple ID's to download

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

genomax, thanks very much for your help- I will try this. Is there any way to use efetch to get the nucleotide sequence as well?

ADD REPLY • link 6.0 years ago by krc3004 ▴ 20

0

Entering edit mode

Just use the right accession number and it will return nucleotide sequence.

ADD REPLY • link 6.0 years ago by GenoMax 141k

score 1 · Accepted Answer · 2018-05-09

That's weird. I looked at the function souce code and it executes only 2 possible functions:

URL <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=", 
            paste(access.nb[a:b], collapse = ","), "&rettype=fasta&retmode=text")

URL <- paste("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=", 
                paste(access.nb[a:b], collapse = ","), "&rettype=gb&retmode=text")

I'm not sure about the paste(access.nb[a:b], collapse = ",") part; however, if you just go to these URLs in your web browser, you then see the expected sequence. For example:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=fasta&retmode=text

This produces:

>AAG26087.1 large delta antigen [Hepatitis delta virus]
MSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP
AKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR
ERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS
CRPQ

The other command:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=gb&retmode=text

This just return the webpage (in plain text) at https://www.ncbi.nlm.nih.gov/protein/AAG26087.1

-----------------------------------------

There are ways of automating this in Python but, if you want to stay within R, you can use the httr package. Here is a very messy example, but I'll leave the tidying up to you:

require(httr)
response <- GET("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=fasta&retmode=text")
content(r, "text")

No encoding supplied: defaulting to UTF-8.
[1] ">AAG26087.1 large delta antigen [Hepatitis delta virus]\nMSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP\nAKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR\nERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS\nCRPQ\n\n"

You can either mess around with that messy format including the end-line characters, or you can 'cheat' your way and get this into a data-frame as follows:

write(content(r, "text"), "test.fasta")
read.csv("test.fasta")
                X.AAG26087.1.large.delta.antigen..Hepatitis.delta.virus.
1 MSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP
2 AKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR
3 ERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS
4                                                                   CRPQ

What you want to do will depend on your end gal. You could easily convert this into a function that accepts a GenBank accession ID and then loop it.

Kevin