Using ape package in R to pull genbank sequences
1
0
Entering edit mode
5.9 years ago
krc3004 ▴ 20

Hi all- I am trying use the ape package in R to pull protein sequences from GenBank. The sequence I'd like to retrieve is the following: https://www.ncbi.nlm.nih.gov/protein/AAG26087.1

To do this I used the following code:

library(ape)
kl_seq = read.GenBank("AAG26087.1", as.character = TRUE)

And the sequence I obtained was:

$AAG26087.1
  [1] "m" "s" "r" "s" "s" "k" "r" "n" "r" "d" "g" "r" "g" "w" "v" "n" "g" "r" "k" "k" "d" "r" "d" "r" "k" "k" "k" "k" "k" "k" "d" "n" "w" "g" "n" "k" "g" "g" "k" "k" "d"
 [42] "k" "d" "g" "g" "a" "a" "k" "r" "a" "r" "t" "d" "m" "d" "s" "g" "g" "k" "r" "r" "g" "g" "s" "d" "k" "r" "d" "h" "r" "r" "r" "k" "a" "n" "k" "r" "k" "a" "a" "g" "g"
 [83] "k" "h" "s" "k" "k" "r" "t" "d" "r" "r" "r" "r" "t" "a" "g" "s" "v" "g" "g" "v" "n" "g" "g" "s" "r" "g" "a" "g" "g" "g" "v" "n" "m" "s" "v" "s" "s" "r" "t" "g" "g"
[124] "d" "v" "r" "g" "n" "g" "w" "d" "a" "d" "s" "s" "c" "r"

attr(,"species")
[1] "Hepatitis_delta_virus"

However, this sequence clearly does not match the sequence of length 214 aa in the above link. I'm wondering if there's something I'm missing here- perhaps there is another argument, function, or package I should use to pull the correct sequence? Is there any way to bulk download from GenBank directly, given a list of accession IDs? Any tips would be much appreciated. Thanks!

R genbank ncbi ape • 3.4k views
ADD COMMENT
1
Entering edit mode

Use NCBI's unix utilities.

efetch -db nuccore -id "AAG26087.1" -format fasta > AAG26087.fa

Replace ID's in a loop if you have multiple ID's to download

ADD REPLY
0
Entering edit mode

genomax, thanks very much for your help- I will try this. Is there any way to use efetch to get the nucleotide sequence as well?

ADD REPLY
0
Entering edit mode

Just use the right accession number and it will return nucleotide sequence.

ADD REPLY
1
Entering edit mode
5.9 years ago

That's weird. I looked at the function souce code and it executes only 2 possible functions:

URL <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=", 
            paste(access.nb[a:b], collapse = ","), "&rettype=fasta&retmode=text")

URL <- paste("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=", 
                paste(access.nb[a:b], collapse = ","), "&rettype=gb&retmode=text")

I'm not sure about the paste(access.nb[a:b], collapse = ",") part; however, if you just go to these URLs in your web browser, you then see the expected sequence. For example:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=fasta&retmode=text

This produces:

>AAG26087.1 large delta antigen [Hepatitis delta virus]
MSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP
AKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR
ERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS
CRPQ

The other command:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=gb&retmode=text

This just return the webpage (in plain text) at https://www.ncbi.nlm.nih.gov/protein/AAG26087.1

-----------------------------------------

There are ways of automating this in Python but, if you want to stay within R, you can use the httr package. Here is a very messy example, but I'll leave the tidying up to you:

require(httr)
response <- GET("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=fasta&retmode=text")
content(r, "text")

No encoding supplied: defaulting to UTF-8.
[1] ">AAG26087.1 large delta antigen [Hepatitis delta virus]\nMSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP\nAKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR\nERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS\nCRPQ\n\n"

You can either mess around with that messy format including the end-line characters, or you can 'cheat' your way and get this into a data-frame as follows:

write(content(r, "text"), "test.fasta")
read.csv("test.fasta")
                X.AAG26087.1.large.delta.antigen..Hepatitis.delta.virus.
1 MSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP
2 AKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR
3 ERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS
4                                                                   CRPQ

What you want to do will depend on your end gal. You could easily convert this into a function that accepts a GenBank accession ID and then loop it.

Kevin

ADD COMMENT
0
Entering edit mode

Thank you Kevin! This solution works on my end but it is indeed messy, as you suggest...not sure why the ape package is returning the incorrect sequence. A further question is how to obtain the nucleotide sequence for these accessions but perhaps I can work that out separately...thanks again!

ADD REPLY
0
Entering edit mode

Yes, as to what is happening in the ape function should probably be directed to the developers. It's an otherwise extremely good package. Best of luck.

ADD REPLY

Login before adding your answer.

Traffic: 2641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6