R: Parsing Fasta From Strings In R?
2
4
Entering edit mode
10.8 years ago
Chris Warth ▴ 110

My goal is to retrieve DNA gene sequence from NCBI using R, but I get stuck trying to parse FASTA sequences from strings.

I can get a FASTA string for a gene by using efetch from the Bioconductor genomes package.

handle = efetch("NM_009790", db="nucleotide", rettype="fasta")

This returns a string consisting of a fasta sequence,

  > handle
 [1] ">gi|118130270:1-700 Mus musculus calmodulin 1 (Calm1), mRNA"           
 [2] "GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTCGGGCGGCGGCTGAGGCTCAGCGCACAACGCAGGT"
 [3] "AGCGCGTTAGCAGCAGCAGAAGCGGAGGCACCTCGGCGGTCACAGCCCCTGCGCTGGTGCAGCCACCCTC"
 ...

Are there routines in bioconductor or elsewhere that can parse this fasta string? Clearly I can write some code to do this, but I thought surely there must be some code that does this already.
I've looked at readDNAStringSet from the BioStrings package, but that seems to only read from files, not character strings. I've also looked at readFASTA from the ShortRead package, but that too only reads from files. That's not surprising because it relies upon the BioStrings package to do that heavy lifting.

Thanks in advance.

fasta r ncbi • 11k views
ADD COMMENT
0
Entering edit mode

+1 for finding that efetch return types are annoying.

ADD REPLY
4
Entering edit mode
10.8 years ago
Michael 54k

Lamentation

What you are getting is a character vector, one entry per row, with the fasta header always in the first entry. Did I mention that this is an odd representation? I think, the first thing to do is to complain to the authors of efetch to integrate better with the Bioconductor infrastructure. Imho, the method efetch should already return a DNAStingSet object by default or at least have an option to convert, e.g. by

as(handle, DNAStringSet)

Their excuse might be historical reasons or dependencies, but I wouldn't buy it ;) The main reason is possibly, that efetch can return very different data types and not all are sequences.

Why is this annoying? Look at this:

efetch(c("NM_009790",'NM_009790'), db="nucleotide", rettype="fasta")

[1] ">gi|118130270:1-700 Mus musculus calmodulin 1 (Calm1), mRNA"           
[2] "GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTCGGGCGGCGGCTGAGGCTCAGCGCACAACGCAGGT"
[3] "AGCGCGTTAGCAGCAGCAGAAGCGGAGGCACCTCGGCGGTCACAGCCCCTGCGCTGGTGCAGCCACCCTC"
[12] ""                                                                      
[13] ">gi|118130270:1-700 Mus musculus calmodulin 1 (Calm1), mRNA"           
[14] "GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTCGGGCGGCGGCTGAGGCTCAGCGCACAACGCAGGT"
[15] "AGCGCGTTAGCAGCAGCAGAAGCGGAGGCACCTCGGCGGTCACAGCCCCTGCGCTGGTGCAGCCACCCTC"
[16] "GCCTGCTCCGTTCTTCCTTCCTTCGCTCGCACCATGGCTGATCAGCTGACTGAAGAGCAGATTGCTGAAT"

Unfortunately, this is still a character vector, while it definitely should be a list type for multiple entries. So your only options are either to parse the vector yourself or to write it to a temporary file and read it with readRNAStringSet, because

  readDNAStringSet(stdin())
  Error in .normargInputFilepath(filepath) : 
  'filepath' must be a character vector with no NAs

Solution

In conclusion, the easiest and mostly safe way is possibly the following. Also, this doesn't create more overhead, because efetch will write its downloaded data to a temporary file anyway.

 tmp = tempfile()
 efetch(c("NM_009790",'NM_009790'), db="nucleotide", retmode="text", rettype="fasta", destfile=tmp)
 readDNAStringSet(tmp)
A DNAStringSet instance of length 2
 width seq                                                                         names               
[1]   700 GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTC...ATTGAAATCTTTTACTTACCTCTTACAAAAAAAAGA gi|118130270:1-70...
[2]   700 GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTC...ATTGAAATCTTTTACTTACCTCTTACAAAAAAAAGA gi|118130270:1-70...
ADD COMMENT
0
Entering edit mode

Thanks for the extensive reply. It looks like this has annoyed you as well. I thought I must be missing something simple but it sounds like it's just an annoying interface. It's too bad because SeqIO makes this very easy in python.

ADD REPLY
0
Entering edit mode

I think Neil's post might contain good options. SeqinR is not a Bioconductor package and rentrez maybe either, that might be the only 'disadvantage'. If you need anything else than a DNAStringSet object rentrez+seqinr might be better.

ADD REPLY
4
Entering edit mode
10.8 years ago
Neilfws 49k

For Eutils in R, I would look at the rentrez package. An efetch example:

COI <- entrez_fetch(db = "nucleotide", id = 167843256, file_format = "fasta")

This returns the fasta as a single character vector which can be written to a file if required.

Another useful package is seqinr. You could parse the fasta string, for example, using:

coi.fa <- read.fasta(file = textConnection(COI), as.string = T)
coi.fa

$`gi|167843256|gb|EU305448.1|`
[1] "atttgatttttggggcttgggcagctatagttggaacagcaataagagtattaattcgtactgagttaggacaacctggtagattattaggtgacgaccagttatataatgtaattgtaacagggcatgcttttgttataattttttttatagtaatgcctattttgattggagggtttgggaactgattagttcctttgatattaggagctcctgatatggctttccctcgaataaataatttaagattttgacttttaccttcatcattaattttattgtttatttcttctttagaggaagtaggggtaggggcaggatgaacaatttacccgcctttgtcaagactagagggtcatagaggtagatctgtggattttgctattttttccttacatttggcaggtgcttcgtctattataggggctattaattttatttctactattttaaacatacggctagtaggggtttcgatagaaaaggtaagattatttgtttggtcagtgttaattactgcggtattattattattgtcattacctgttttagctggtgctattactatattactaactgatcgtaattttaatacttcattctttgatcctgctggtggaggggatcctattttatttcaacatttgttttgattttttgggcaccctgaggtttatattttaattttgcctggatttgggattgtatctcatgttattagagcttctgtagggaagcgggagccttttggtagtttaggaataatttatgctatagtaggaattggagggataggttttgttgtgtgagcgcatcatatattttcagttggaatggatgtggatactcgagcatattttactgctgccactataattattgcagtgccaacaggtattaaggtctttagatggatagctactttgcatggttcttattttaaattagatacacctttattatggtgtgtaggttttgtatttttgtttactttaggaggaattacaggggtagtactttcaaattcttctttagatattgttttacacgatacttattatgttgtagctcattttcattatgtcttgagaataggggctgtttttgctattttggctggtgctacttattgattttctttattttttggattgaagatgagaatgaagaaaagaaagcttcagttttataccatatttttgggtgtgaatattacttttttcctcaacactttaggta"
attr(,"name")
[1] "gi|167843256|gb|EU305448.1|"
attr(,"Annot")
[1] ">gi|167843256|gb|EU305448.1| Latrodectus katipo voucher La13 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial"
attr(,"class")
[1] "SeqFastadna"
ADD COMMENT
0
Entering edit mode

Thank you for an equally good solution. The textConnection(COI) solution is pretty much what I expected the solution to look like.

ADD REPLY

Login before adding your answer.

Traffic: 1794 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6