How To Extract Multiple Fasta Sequences At A Time From A File Containing Sequences Ids Using R
1
0
Entering edit mode
8.8 years ago
catagui ▴ 40

Hello I will like to know how can I extract multiple fasta sequences from a file that have a list of the IDs (133 in total) I want to extract. I have started by loading my fasta and ID file in R:

library("seqinr")
fastafile<- read.fasta(file = "proteins.fasta", 
                       seqtype = "AA",as.string = TRUE, set.attributes = FALSE)

head(fastafile)

$`1.1.1.m1`
[1] "MRRRGQWWFTAETSVGQTANTSANSDLLSPAFWLVRGHEFKITRSDDPQHTALLQTSDDCLGGQTFRAKITSYGRFTERESWEIKPNVDGCRGSCNVSYAGRFEETVGFKQAKCSSRIQSEKNIGFWCAIGSRGSVMMIGGGGKPCTLGDHGIGITNAKDRSFSHSPSSKRNDFGDVATSSPETSYSLNLWIQ"

$`1.1.2.m1`
[1] "MHEHTSQSVACGAQTEEVLRSITMRRKTNYQTATTCLVKLIFEHVLNVRKTNSIEKFDGLEARHRKHIKEIVALEINPNSFGISERQGPIPQPVILFPLNAEYQARDVKNRTAPGIPSGVSLAPGPNGEKDGSYEFFGNTNSFIEFPNSPRGALDVLYSITILCWVYYDEKGGPHGLIFEYNTGGKYGVHLWVVNRLFSARFIDRAFSYSRPYLRHTSLAGGWKFVGASYDNETGEIKLWADGA"

co2=read.table("trt_co.csv",header=T, sep=",")
head(co2)

1 1.1.10073.m1
2 1.1.10395.m1
3 1.1.10428.m1
4 1.1.10509.m1
5 1.1.10621.m1
6 1.1.10760.m1

I will appreciate your help on what would be the next step.

Thanks

r fasta • 9.6k views
ADD COMMENT
0
Entering edit mode

You use read.table(header = T), but head(co2) does not show column names? Or did you just omit them from this post?

ADD REPLY
0
Entering edit mode

I'm just curious, why would you do that in R?

ADD REPLY
0
Entering edit mode

One reason: because your downstream analysis is most easily performed in R. The seqinr package contains a lot of useful functions for statistical analysis of sequences; reading them in is just the first step.

ADD REPLY
0
Entering edit mode

In which program would you advice doing it?

ADD REPLY
0
Entering edit mode

I would extract them before I load it into R, for example using 'grep' (if your fasta hast sequences in just one line) some script pyhton/bash/perl....

if you are on a linux machine you can also go with kent source utils: kent source. There is a lof of usefull stuff in your case look at "faSomeRecords".

ADD REPLY
2
Entering edit mode
8.8 years ago
Neilfws 49k

You don't say what the column in co2 containing the IDs is called, but let's assume that it is named ids.

Then you can simply subset the list fastafile like this:

fastafile[names(fastafile) %in% co2$ids]
ADD COMMENT
0
Entering edit mode

Sorry, the column didn't had a name when I copy from text file.

I have tried it and it works great.

Thanks

ADD REPLY
0
Entering edit mode

Hey,

Sorry I am new to R. After entering the line of code:

fastafile[names(fastafile) %in% list$ids

R returns the following:

named list()

Can anyone explain this to me?

ADD REPLY
0
Entering edit mode

This doesn't work because the fasta format is a list of the id, ali and call for that set of data. When you try to subset this way, it is trying to find the list of sequences you want in the list that is currently [id, ali, call], so it gives you an empty result (since none of your sequence names match those objects).

I did this instead, creating a new fasta object with the subsetted alignment object (the second object in the fasta list) that I took out of my original fastafile.

newfastafile=fastafile(fastafile$ali[fastafile$id %in% list$ids,])

I was encountering the same problem.

ADD REPLY
0
Entering edit mode

OK so I did this and I get an error that says fastafile is not a function so is fastafile that is outside of the () just supposed to be calling the object created earlier?

ADD REPLY

Login before adding your answer.

Traffic: 2000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6