Question: R: Parsing Fasta From Strings In R?
1
gravatar for Chris Warth
6.1 years ago by
Chris Warth90
United States
Chris Warth90 wrote:

My goal is to retrieve DNA gene sequence from NCBI using R, but I get stuck trying to parse FASTA sequences from strings.

I can get a FASTA string for a gene by using efetch from the Bioconductor genomes package.

handle = efetch("NM_009790", db="nucleotide", rettype="fasta")

This returns a string consisting of a fasta sequence,

  > handle
 [1] ">gi|118130270:1-700 Mus musculus calmodulin 1 (Calm1), mRNA"           
 [2] "GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTCGGGCGGCGGCTGAGGCTCAGCGCACAACGCAGGT"
 [3] "AGCGCGTTAGCAGCAGCAGAAGCGGAGGCACCTCGGCGGTCACAGCCCCTGCGCTGGTGCAGCCACCCTC"
 ...

Are there routines in bioconductor or elsewhere that can parse this fasta string? Clearly I can write some code to do this, but I thought surely there must be some code that does this already.
I've looked at readDNAStringSet from the BioStrings package, but that seems to only read from files, not character strings. I've also looked at readFASTA from the ShortRead package, but that too only reads from files. That's not surprising because it relies upon the BioStrings package to do that heavy lifting.

Thanks in advance.

fasta R ncbi • 6.8k views
ADD COMMENTlink modified 6.1 years ago by Neilfws48k • written 6.1 years ago by Chris Warth90

+1 for finding that efetch return types are annoying.

ADD REPLYlink written 6.1 years ago by Michael Dondrup46k
4
gravatar for Michael Dondrup
6.1 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Lamentation

What you are getting is a character vector, one entry per row, with the fasta header always in the first entry. Did I mention that this is an odd representation? I think, the first thing to do is to complain to the authors of efetch to integrate better with the Bioconductor infrastructure. Imho, the method efetch should already return a DNAStingSet object by default or at least have an option to convert, e.g. by

as(handle, DNAStringSet)

Their excuse might be historical reasons or dependencies, but I wouldn't buy it ;) The main reason is possibly, that efetch can return very different data types and not all are sequences.

Why is this annoying? Look at this:

efetch(c("NM_009790",'NM_009790'), db="nucleotide", rettype="fasta")

[1] ">gi|118130270:1-700 Mus musculus calmodulin 1 (Calm1), mRNA"           
[2] "GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTCGGGCGGCGGCTGAGGCTCAGCGCACAACGCAGGT"
[3] "AGCGCGTTAGCAGCAGCAGAAGCGGAGGCACCTCGGCGGTCACAGCCCCTGCGCTGGTGCAGCCACCCTC"
[12] ""                                                                      
[13] ">gi|118130270:1-700 Mus musculus calmodulin 1 (Calm1), mRNA"           
[14] "GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTCGGGCGGCGGCTGAGGCTCAGCGCACAACGCAGGT"
[15] "AGCGCGTTAGCAGCAGCAGAAGCGGAGGCACCTCGGCGGTCACAGCCCCTGCGCTGGTGCAGCCACCCTC"
[16] "GCCTGCTCCGTTCTTCCTTCCTTCGCTCGCACCATGGCTGATCAGCTGACTGAAGAGCAGATTGCTGAAT"

Unfortunately, this is still a character vector, while it definitely should be a list type for multiple entries. So your only options are either to parse the vector yourself or to write it to a temporary file and read it with readRNAStringSet, because

  readDNAStringSet(stdin())
  Error in .normargInputFilepath(filepath) : 
  'filepath' must be a character vector with no NAs

Solution

In conclusion, the easiest and mostly safe way is possibly the following. Also, this doesn't create more overhead, because efetch will write its downloaded data to a temporary file anyway.

 tmp = tempfile()
 efetch(c("NM_009790",'NM_009790'), db="nucleotide", retmode="text", rettype="fasta", destfile=tmp)
 readDNAStringSet(tmp)
A DNAStringSet instance of length 2
 width seq                                                                         names               
[1]   700 GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTC...ATTGAAATCTTTTACTTACCTCTTACAAAAAAAAGA gi|118130270:1-70...
[2]   700 GGGAGTCTCGTGTCCGTGGTGCCGTTACTCGAAGTC...ATTGAAATCTTTTACTTACCTCTTACAAAAAAAAGA gi|118130270:1-70...
ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Michael Dondrup46k

Thanks for the extensive reply. It looks like this has annoyed you as well. I thought I must be missing something simple but it sounds like it's just an annoying interface. It's too bad because SeqIO makes this very easy in python.

ADD REPLYlink written 6.1 years ago by Chris Warth90

I think Neil's post might contain good options. SeqinR is not a Bioconductor package and rentrez maybe either, that might be the only 'disadvantage'. If you need anything else than a DNAStringSet object rentrez+seqinr might be better.

ADD REPLYlink written 6.1 years ago by Michael Dondrup46k
4
gravatar for Neilfws
6.1 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

For Eutils in R, I would look at the rentrez package. An efetch example:

COI <- entrez_fetch(db = "nucleotide", id = 167843256, file_format = "fasta")

This returns the fasta as a single character vector which can be written to a file if required.

Another useful package is seqinr. You could parse the fasta string, for example, using:

coi.fa <- read.fasta(file = textConnection(COI), as.string = T)
coi.fa

$`gi|167843256|gb|EU305448.1|`
[1] "atttgatttttggggcttgggcagctatagttggaacagcaataagagtattaattcgtactgagttaggacaacctggtagattattaggtgacgaccagttatataatgtaattgtaacagggcatgcttttgttataattttttttatagtaatgcctattttgattggagggtttgggaactgattagttcctttgatattaggagctcctgatatggctttccctcgaataaataatttaagattttgacttttaccttcatcattaattttattgtttatttcttctttagaggaagtaggggtaggggcaggatgaacaatttacccgcctttgtcaagactagagggtcatagaggtagatctgtggattttgctattttttccttacatttggcaggtgcttcgtctattataggggctattaattttatttctactattttaaacatacggctagtaggggtttcgatagaaaaggtaagattatttgtttggtcagtgttaattactgcggtattattattattgtcattacctgttttagctggtgctattactatattactaactgatcgtaattttaatacttcattctttgatcctgctggtggaggggatcctattttatttcaacatttgttttgattttttgggcaccctgaggtttatattttaattttgcctggatttgggattgtatctcatgttattagagcttctgtagggaagcgggagccttttggtagtttaggaataatttatgctatagtaggaattggagggataggttttgttgtgtgagcgcatcatatattttcagttggaatggatgtggatactcgagcatattttactgctgccactataattattgcagtgccaacaggtattaaggtctttagatggatagctactttgcatggttcttattttaaattagatacacctttattatggtgtgtaggttttgtatttttgtttactttaggaggaattacaggggtagtactttcaaattcttctttagatattgttttacacgatacttattatgttgtagctcattttcattatgtcttgagaataggggctgtttttgctattttggctggtgctacttattgattttctttattttttggattgaagatgagaatgaagaaaagaaagcttcagttttataccatatttttgggtgtgaatattacttttttcctcaacactttaggta"
attr(,"name")
[1] "gi|167843256|gb|EU305448.1|"
attr(,"Annot")
[1] ">gi|167843256|gb|EU305448.1| Latrodectus katipo voucher La13 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial"
attr(,"class")
[1] "SeqFastadna"
ADD COMMENTlink written 6.1 years ago by Neilfws48k

Thank you for an equally good solution. The textConnection(COI) solution is pretty much what I expected the solution to look like.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Chris Warth90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1036 users visited in the last hour