Question

R: Parsing Fasta From Strings In R?

4

Entering edit mode

10.8 years ago

Chris Warth ▴ 110

My goal is to retrieve DNA gene sequence from NCBI using R, but I get stuck trying to parse FASTA sequences from strings.

I can get a FASTA string for a gene by using efetch from the Bioconductor genomes package.

handle = efetch("NM_009790", db="nucleotide", rettype="fasta")

This returns a string consisting of a fasta sequence,

  > handle
 [1] ">gi|118130270:1-700 Mus musculus calmodulin 1 (Calm1), mRNA"           
 [2] "GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTCGGGCGGCGGCTGAGGCTCAGCGCACAACGCAGGT"
 [3] "AGCGCGTTAGCAGCAGCAGAAGCGGAGGCACCTCGGCGGTCACAGCCCCTGCGCTGGTGCAGCCACCCTC"
 ...

Are there routines in bioconductor or elsewhere that can parse this fasta string? Clearly I can write some code to do this, but I thought surely there must be some code that does this already.
I've looked at readDNAStringSet from the BioStrings package, but that seems to only read from files, not character strings. I've also looked at readFASTA from the ShortRead package, but that too only reads from files. That's not surprising because it relies upon the BioStrings package to do that heavy lifting.

Thanks in advance.

fasta r ncbi • 11k views

ADD COMMENT • link updated 10.8 years ago by Neilfws 49k • written 10.8 years ago by Chris Warth ▴ 110

0

Entering edit mode

+1 for finding that efetch return types are annoying.

ADD REPLY • link 10.8 years ago by Michael 54k

score 4 · Answer 1 · 2013-07-02

Lamentation

What you are getting is a character vector, one entry per row, with the fasta header always in the first entry. Did I mention that this is an odd representation? I think, the first thing to do is to complain to the authors of efetch to integrate better with the Bioconductor infrastructure. Imho, the method efetch should already return a DNAStingSet object by default or at least have an option to convert, e.g. by

as(handle, DNAStringSet)

Their excuse might be historical reasons or dependencies, but I wouldn't buy it ;) The main reason is possibly, that efetch can return very different data types and not all are sequences.

Why is this annoying? Look at this:

efetch(c("NM_009790",'NM_009790'), db="nucleotide", rettype="fasta")

[1] ">gi|118130270:1-700 Mus musculus calmodulin 1 (Calm1), mRNA"           
[2] "GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTCGGGCGGCGGCTGAGGCTCAGCGCACAACGCAGGT"
[3] "AGCGCGTTAGCAGCAGCAGAAGCGGAGGCACCTCGGCGGTCACAGCCCCTGCGCTGGTGCAGCCACCCTC"
[12] ""                                                                      
[13] ">gi|118130270:1-700 Mus musculus calmodulin 1 (Calm1), mRNA"           
[14] "GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTCGGGCGGCGGCTGAGGCTCAGCGCACAACGCAGGT"
[15] "AGCGCGTTAGCAGCAGCAGAAGCGGAGGCACCTCGGCGGTCACAGCCCCTGCGCTGGTGCAGCCACCCTC"
[16] "GCCTGCTCCGTTCTTCCTTCCTTCGCTCGCACCATGGCTGATCAGCTGACTGAAGAGCAGATTGCTGAAT"

Unfortunately, this is still a character vector, while it definitely should be a list type for multiple entries. So your only options are either to parse the vector yourself or to write it to a temporary file and read it with readRNAStringSet, because

  readDNAStringSet(stdin())
  Error in .normargInputFilepath(filepath) : 
  'filepath' must be a character vector with no NAs

Solution

In conclusion, the easiest and mostly safe way is possibly the following. Also, this doesn't create more overhead, because efetch will write its downloaded data to a temporary file anyway.

 tmp = tempfile()
 efetch(c("NM_009790",'NM_009790'), db="nucleotide", retmode="text", rettype="fasta", destfile=tmp)
 readDNAStringSet(tmp)
A DNAStringSet instance of length 2
 width seq                                                                         names               
[1]   700 GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTC...ATTGAAATCTTTTACTTACCTCTTACAAAAAAAAGA gi|118130270:1-70...
[2]   700 GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTC...ATTGAAATCTTTTACTTACCTCTTACAAAAAAAAGA gi|118130270:1-70...

score 4 · Answer 2 · 2013-07-02

For Eutils in R, I would look at the rentrez package. An efetch example:

COI <- entrez_fetch(db = "nucleotide", id = 167843256, file_format = "fasta")

This returns the fasta as a single character vector which can be written to a file if required.

Another useful package is seqinr. You could parse the fasta string, for example, using:

coi.fa <- read.fasta(file = textConnection(COI), as.string = T)
coi.fa

$`gi|167843256|gb|EU305448.1|`
[1] "atttgatttttggggcttgggcagctatagttggaacagcaataagagtattaattcgtactgagttaggacaacctggtagattattaggtgacgaccagttatataatgtaattgtaacagggcatgcttttgttataattttttttatagtaatgcctattttgattggagggtttgggaactgattagttcctttgatattaggagctcctgatatggctttccctcgaataaataatttaagattttgacttttaccttcatcattaattttattgtttatttcttctttagaggaagtaggggtaggggcaggatgaacaatttacccgcctttgtcaagactagagggtcatagaggtagatctgtggattttgctattttttccttacatttggcaggtgcttcgtctattataggggctattaattttatttctactattttaaacatacggctagtaggggtttcgatagaaaaggtaagattatttgtttggtcagtgttaattactgcggtattattattattgtcattacctgttttagctggtgctattactatattactaactgatcgtaattttaatacttcattctttgatcctgctggtggaggggatcctattttatttcaacatttgttttgattttttgggcaccctgaggtttatattttaattttgcctggatttgggattgtatctcatgttattagagcttctgtagggaagcgggagccttttggtagtttaggaataatttatgctatagtaggaattggagggataggttttgttgtgtgagcgcatcatatattttcagttggaatggatgtggatactcgagcatattttactgctgccactataattattgcagtgccaacaggtattaaggtctttagatggatagctactttgcatggttcttattttaaattagatacacctttattatggtgtgtaggttttgtatttttgtttactttaggaggaattacaggggtagtactttcaaattcttctttagatattgttttacacgatacttattatgttgtagctcattttcattatgtcttgagaataggggctgtttttgctattttggctggtgctacttattgattttctttattttttggattgaagatgagaatgaagaaaagaaagcttcagttttataccatatttttgggtgtgaatattacttttttcctcaacactttaggta"
attr(,"name")
[1] "gi|167843256|gb|EU305448.1|"
attr(,"Annot")
[1] ">gi|167843256|gb|EU305448.1| Latrodectus katipo voucher La13 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial"
attr(,"class")
[1] "SeqFastadna"