Hello I will like to know how can I extract multiple fasta sequences from a file that have a list of the IDs (133 in total) I want to extract. I have started by loading my fasta and ID file in R:
library("seqinr") fastafile<- read.fasta(file = "proteins.fasta", seqtype = "AA",as.string = TRUE, set.attributes = FALSE) head(fastafile) $`1.1.1.m1`  "MRRRGQWWFTAETSVGQTANTSANSDLLSPAFWLVRGHEFKITRSDDPQHTALLQTSDDCLGGQTFRAKITSYGRFTERESWEIKPNVDGCRGSCNVSYAGRFEETVGFKQAKCSSRIQSEKNIGFWCAIGSRGSVMMIGGGGKPCTLGDHGIGITNAKDRSFSHSPSSKRNDFGDVATSSPETSYSLNLWIQ" $`1.1.2.m1`  "MHEHTSQSVACGAQTEEVLRSITMRRKTNYQTATTCLVKLIFEHVLNVRKTNSIEKFDGLEARHRKHIKEIVALEINPNSFGISERQGPIPQPVILFPLNAEYQARDVKNRTAPGIPSGVSLAPGPNGEKDGSYEFFGNTNSFIEFPNSPRGALDVLYSITILCWVYYDEKGGPHGLIFEYNTGGKYGVHLWVVNRLFSARFIDRAFSYSRPYLRHTSLAGGWKFVGASYDNETGEIKLWADGA" co2=read.table("trt_co.csv",header=T, sep=",") head(co2) 1 1.1.10073.m1 2 1.1.10395.m1 3 1.1.10428.m1 4 1.1.10509.m1 5 1.1.10621.m1 6 1.1.10760.m1
I will appreciate your help on what would be the next step.
You use read.table(header = T), but head(co2) does not show column names? Or did you just omit them from this post?
I'm just curious, why would you do that in R?
One reason: because your downstream analysis is most easily performed in R. The seqinr package contains a lot of useful functions for statistical analysis of sequences; reading them in is just the first step.
In which program would you advice doing it?
I would extract them before I load it into R, for example using 'grep' (if your fasta hast sequences in just one line) some script pyhton/bash/perl....
if you are on a linux machine you can also go with kent source utils: kent source. There is a lof of usefull stuff in your case look at "faSomeRecords".