Question: Using ape package in R to pull genbank sequences
0
gravatar for krc3004
20 months ago by
krc300410
krc300410 wrote:

Hi all- I am trying use the ape package in R to pull protein sequences from GenBank. The sequence I'd like to retrieve is the following: https://www.ncbi.nlm.nih.gov/protein/AAG26087.1

To do this I used the following code:

library(ape)
kl_seq = read.GenBank("AAG26087.1", as.character = TRUE)

And the sequence I obtained was:

$AAG26087.1
  [1] "m" "s" "r" "s" "s" "k" "r" "n" "r" "d" "g" "r" "g" "w" "v" "n" "g" "r" "k" "k" "d" "r" "d" "r" "k" "k" "k" "k" "k" "k" "d" "n" "w" "g" "n" "k" "g" "g" "k" "k" "d"
 [42] "k" "d" "g" "g" "a" "a" "k" "r" "a" "r" "t" "d" "m" "d" "s" "g" "g" "k" "r" "r" "g" "g" "s" "d" "k" "r" "d" "h" "r" "r" "r" "k" "a" "n" "k" "r" "k" "a" "a" "g" "g"
 [83] "k" "h" "s" "k" "k" "r" "t" "d" "r" "r" "r" "r" "t" "a" "g" "s" "v" "g" "g" "v" "n" "g" "g" "s" "r" "g" "a" "g" "g" "g" "v" "n" "m" "s" "v" "s" "s" "r" "t" "g" "g"
[124] "d" "v" "r" "g" "n" "g" "w" "d" "a" "d" "s" "s" "c" "r"

attr(,"species")
[1] "Hepatitis_delta_virus"

However, this sequence clearly does not match the sequence of length 214 aa in the above link. I'm wondering if there's something I'm missing here- perhaps there is another argument, function, or package I should use to pull the correct sequence? Is there any way to bulk download from GenBank directly, given a list of accession IDs? Any tips would be much appreciated. Thanks!

ape genbank R ncbi • 1.4k views
ADD COMMENTlink modified 20 months ago by Kevin Blighe53k • written 20 months ago by krc300410
1

Use NCBI's unix utilities.

efetch -db nuccore -id "AAG26087.1" -format fasta > AAG26087.fa

Replace ID's in a loop if you have multiple ID's to download

ADD REPLYlink written 20 months ago by genomax76k

genomax, thanks very much for your help- I will try this. Is there any way to use efetch to get the nucleotide sequence as well?

ADD REPLYlink written 20 months ago by krc300410

Just use the right accession number and it will return nucleotide sequence.

ADD REPLYlink written 20 months ago by genomax76k
1
gravatar for Kevin Blighe
20 months ago by
Kevin Blighe53k
Kevin Blighe53k wrote:

That's weird. I looked at the function souce code and it executes only 2 possible functions:

URL <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=", 
            paste(access.nb[a:b], collapse = ","), "&rettype=fasta&retmode=text")

URL <- paste("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=", 
                paste(access.nb[a:b], collapse = ","), "&rettype=gb&retmode=text")

I'm not sure about the paste(access.nb[a:b], collapse = ",") part; however, if you just go to these URLs in your web browser, you then see the expected sequence. For example:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=fasta&retmode=text

This produces:

>AAG26087.1 large delta antigen [Hepatitis delta virus]
MSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP
AKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR
ERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS
CRPQ

The other command:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=gb&retmode=text

This just return the webpage (in plain text) at https://www.ncbi.nlm.nih.gov/protein/AAG26087.1

-----------------------------------------

There are ways of automating this in Python but, if you want to stay within R, you can use the httr package. Here is a very messy example, but I'll leave the tidying up to you:

require(httr)
response <- GET("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=fasta&retmode=text")
content(r, "text")

No encoding supplied: defaulting to UTF-8.
[1] ">AAG26087.1 large delta antigen [Hepatitis delta virus]\nMSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP\nAKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR\nERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS\nCRPQ\n\n"

You can either mess around with that messy format including the end-line characters, or you can 'cheat' your way and get this into a data-frame as follows:

write(content(r, "text"), "test.fasta")
read.csv("test.fasta")
                X.AAG26087.1.large.delta.antigen..Hepatitis.delta.virus.
1 MSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP
2 AKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR
3 ERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS
4                                                                   CRPQ

What you want to do will depend on your end gal. You could easily convert this into a function that accepts a GenBank accession ID and then loop it.

Kevin

ADD COMMENTlink modified 20 months ago • written 20 months ago by Kevin Blighe53k

Thank you Kevin! This solution works on my end but it is indeed messy, as you suggest...not sure why the ape package is returning the incorrect sequence. A further question is how to obtain the nucleotide sequence for these accessions but perhaps I can work that out separately...thanks again!

ADD REPLYlink written 20 months ago by krc300410

Yes, as to what is happening in the ape function should probably be directed to the developers. It's an otherwise extremely good package. Best of luck.

ADD REPLYlink written 20 months ago by Kevin Blighe53k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1572 users visited in the last hour