How to subset a set of EST sequences in fasta file based on sequence ID or Name using R?
1
1
Entering edit mode
10.0 years ago
second_exon ▴ 210

I have thousands of EST sequences in a fasta file. How to subset a set of sequences based on sequence ID or name using R?

Sequence example:

>gi|296783888|gb|GW992815.1|GW992815 UAS-Mi10 Complementary DNA of mulberry (Morus indica) Morus indica cDNA 5' similar to Putative phosphoribosyltransferase/phosphoribosylanthranilate-like gene from Morus indica, mRNA sequence
GCAGCCGTCGGATCGTGAGCGTGATCGCGTGGCTAGTCGGGTTGGCGAAATGGTTGGATGATATCCGGAG
GTGGAGGAACCCCATTACCACGGTATTGGTCCACATCTTATATTTAGTGCTTGTTTGGTACCCGGATTTG
ATTGTCCCAACCGGGTTTTTATATGTGTTCCTAATCGGTGTATGGTACTATCGGTTTCGGCCCAAGATAC
CAGCGGGTATGGATACCCGACTCTCACAAGCTGAAGCGGTTGACCCGGATGAGCTTGATGAGGAATTCGA
CACCATACCGAGCTCAAAACCACCCGACATAATCAGGGTCCGGTATGACCGGTTGCGGATATTGGCAGCC
CGGGTTCAAACGGTTTTGGGTGATTTTGCAACACAAGGGGAGCGGGTTCAGGCCTTGGTTAGCTGGAGGG
ACCCAAGGGCCACAAAATTGTTCATAGGCGTGTGCTTGGCCATAACAATAATTCTCTATGTGGTGCCACC
CAAAATGGTTGCCGTGGCACTTGGATTCTACTATTTACGACACCCCATGTTCCGAGACCCCATGCCTCCT
GCAAGCTTGAATTTCTTCAGAAGGCTTCCAAGCCTTTCAGACCGCTTTAATGTAGATTAGAATATTATAT
GATTATTAGTAGGCCCAA

>gi|296783887|gb|GW992814.1|GW992814 UAS-Mi9 Complementary DNA of mulberry (Morus indica) Morus indica cDNA 5' similar to Dehydration-responsive protein RD22, Similar to BURP domain-containing protein like gene from Morus indica, mRNA sequence
AAGCAGTGGTCTAGAACCAGAGTGGCCCCTGCGATGCAGGTATCATCTCTATTATCAAAAGGGATAAGGG
GTGGATCCGTCGGGGATTTGAGTCTCACATGGTCGCTGATAACTTATTGAATGGATATTGGATTGTGTGC
AGTGCGACCTAAACAGGATTGCCGTTGGGGCCTGTGGTCAGAGATACCCCACACTTCTCAACTCCCAAAT
TGGATCTTGTTCCTTGTTTTCCTGTATTAAGCCTGACCCCTGAGGCTTTCGCCACTGCCAACTGGGTGCC
GCCTGCTGACTTCTGATTCCCCGTGCTAACGGTTACTCCCGATTCCTTATCCACATCGAAGATGAACTAT
TGACTTCCGCAAACTCAAAAGGCTGCAAGATATCACTGACCGCTGTCGGGATCCGCGATCGGCATATACG
CGAAATCCGATCCCGGATCCCGGGACTGCAGACGGCTGAA

Like using this header:

>gi|296783888|gb|GW992815.1|GW992815 UAS-Mi10 Complementary DNA of mulberry (Morus indica) Morus indica cDNA 5' similar to Putative phosphoribosyltransferase/phosphoribosylanthranilate-like gene from Morus indica, mRNA sequence

or by using gi number?

How to do this in R?

R EST subset • 3.3k views
ADD COMMENT
4
Entering edit mode
10.0 years ago
library(Biostrings)
f <- readDNAStringSet("sequences.fa")

You can then use the names(f) accessor to either directly match the names of the sequences, grep with gi accession, or even just split the names and directly match the gi accessors. The f object can be directly subset (e.g., f[c(1,3,5,6)]).

ADD COMMENT

Login before adding your answer.

Traffic: 2670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6