Question: Extract A Group Of Fasta Sequences From A File
6
gravatar for Kyle
7.3 years ago by
Kyle60
Kyle60 wrote:

I have a group of fasta sequences numbered sequentially 1,2, etc. and would like to extract a subset by their numbers/identifiers. Using bp_index and bp_fetch in bioperl enables me to extract a single sequences, but I am not familiar enough with perl to automate this for a large group. Any suggestions, ideally in python but also in perl? Thanks.

ADD COMMENTlink modified 18 months ago by cpcantalapiedra140 • written 7.3 years ago by Kyle60
14
gravatar for Eric Normandeau
7.3 years ago by
Eric Normandeau9.6k
Quebec, Canada
Eric Normandeau9.6k wrote:

Hi,

Here is a quick solution in Python.

from Bio import SeqIO

fasta_file = "fasta_file.fasta" # Input fasta file
wanted_file = "wanted_file.txt" # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")

Cheers!

ADD COMMENTlink modified 6.2 years ago • written 7.3 years ago by Eric Normandeau9.6k
2

i think that's a poor representation of how easy it is to do this in biopython (sorry eric, but ...). how about something like this:

http://gist.github.com/477969

looks less like java, uses less code, and is more flexible.

ADD REPLYlink written 7.3 years ago by brentp22k
1

@eric, what does (number_file) stands for, in your code? Just because I don't understand it. Is it the wanted_file? Thank u!

ADD REPLYlink written 6.2 years ago by Peixe510
1

Hi @peixe, you had guessed right. I changed it to 'wanted_file'. Cheers

ADD REPLYlink written 6.2 years ago by Eric Normandeau9.6k
1

Just found this response (after posting on 2827),but the results file does not actually get anything written to it. One main issue I can think of is I am just trying to compare "seq.id" and not the full "seq.description". Does that make a difference - if my seq.id is less than the full fasta >seq.description? Sorry for the confusion, thanks. -e

ADD REPLYlink written 5.8 years ago by User 752110
1

Yes, it matters, .id is the first word only, .description is the full title line from the > line in the FASTA file.

ADD REPLYlink written 3.9 years ago by Peter5.6k

Wow ! great thanks!

ADD REPLYlink written 7.3 years ago by Kyle60

@brentp I think the 'try/except' may be a little scary, but I find them necessary to insure that one gets a sensible warning if ever something goes wrong. Something more precise than 'go look at line 43, index is too big'. :) Your example is very beautiful, and much more pythonesque than my own, however, I wanted to avoid reading the whole file at once. That way, a very large fasta file can be read on a typical student laptop. Since I am toying with NGS data these days, I tend to avoid loading whole files :) Thanks for the very neat example!

ADD REPLYlink written 7.3 years ago by Eric Normandeau9.6k

@eric, my example never holds the entire fasta file in memory. your code will give a message if the header doesn't exist in the file, but that's just one more line!

ADD REPLYlink written 7.3 years ago by brentp22k

@brentp Sorry, I thought you were using a list comprehension in the SeqIO.write function. Can I then use your suggestions to update my code and make it better? Would you rather post an answer yourself? In any case, thanks for the insight. I've been thinking all night long that there may be something I was missing in my current approach :)

ADD REPLYlink written 7.3 years ago by Eric Normandeau9.6k

@eric, of course you can use the suggestions to update. your answer does the right thing, i just wanted to point out a terser approach.

ADD REPLYlink written 7.3 years ago by brentp22k

Perfect! Thanks @eric!

ADD REPLYlink written 6.2 years ago by Peixe510

Current version has a bug, rather than "if name in wanted" should be "if seq.id in wanted" or "if seq.name in wanted" - you haven't even got a variable called name.

But it should be faster to use a generator expression as in brentp's comment.

ADD REPLYlink written 6.2 years ago by Peter5.6k

Hi @Peter. You are right on both points. There was a bug (corrected now) and brentp's version is better, I think for quite a few reasons. Cheers

ADD REPLYlink written 6.2 years ago by Eric Normandeau9.6k

@Eric Thank you!!!

 

ADD REPLYlink written 3.2 years ago by ljohanam20

Whenever I try to run the script it says

 File "peptideextract.py", line 12, in <module>
    fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
NameError: name 'SeqIO' is not defined

Where is SeqIO supposed to be defined? Is it supposed to be a column within our fasta and/or text file?

Thanks!

ADD REPLYlink written 3.1 years ago by skilleta05270

Have you imported the module? Like this:

from Bio import SeqIO

p.

ADD REPLYlink written 3.1 years ago by Peixe510
<<<<<<<<<<<<<<<<<
ADD REPLYlink modified 5 months ago • written 2.9 years ago by jackiegoordial0

Use a for loop over the files found with os.listdir(...) or the glob module?

ADD REPLYlink written 2.9 years ago by Peter5.6k
11
gravatar for brentp
7.3 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

fwiw, you can do this from the commandline if you have pyfasta installed with:

$ pyfasta extract --header --fasta my.fa header1 header2 header3 subset.fa

or if the headers you want are listed in a file:

$ pyfasta extract --header --fasta my.fa --file header-list.txt > subset.fa
ADD COMMENTlink written 7.3 years ago by brentp22k
2
gravatar for Daniel Swan
7.3 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

This question I believe has already been quite well answered here

ADD COMMENTlink written 7.3 years ago by Daniel Swan13k
1

The OP is asking for a Python solution preferably, which the other questions aswers do not provide :)

ADD REPLYlink written 7.3 years ago by Eric Normandeau9.6k

The OP is asking for a python solution here, not a Perl one :)

ADD REPLYlink written 7.3 years ago by Eric Normandeau9.6k

no but he did say 'and also in Perl' which the linked answer did. I leave it to y u to pick up the reputation for the accepted solution :D

ADD REPLYlink written 7.3 years ago by Daniel Swan13k

You are very kind to leave it to me. However, I had no choice in making the decision of accepting this answer. I merely suggested one approach which, to my better knowledge, corresponded to the OP's need. Cheers

ADD REPLYlink written 7.3 years ago by Eric Normandeau9.6k

You are very kind to leave it to me. However, I had no choice in making the decision of accepting this answer. I merely suggested one approach which, to my best knowledge, corresponded to the OP's described need. Cheers

ADD REPLYlink written 7.3 years ago by Eric Normandeau9.6k
1
gravatar for Fallino
5.6 years ago by
Fallino20
Fallino20 wrote:

Here is a solution with GNU pcregrep, this is what I generally use to extract an entry (title+sequence) from a fasta file (in this example gi|116108791 and gi|30261391 in NCBI nr.fasta, you can add more patterns with -e) :

pcregrep --buffer-size=16M -M -e 'gi[[:punct:]]116108791[[:punct:]].*\s([[:alpha:]]+\s*)*' -e 'gi[[:punct:]]30261391[[:punct:]].*\s([[:alpha:]]+\s*)*' nr.fasta

--buffer-size to increase the default buffer-size which was not sufficient for NCBInr

-M for multiline matching

-e for adding a pattern

Depending of your platform, use single/double quotes. This has been tested with the Windows GNU cygwin version 8.21 2011-12-12 of PCREGREP but should work with the pcregrep binary of every Linux distribution. The main limitation is that it has to parse the entire file to return control. I tried to add the '--match-limit' parameter to the number of patterns but gave me some weird errors.

ADD COMMENTlink written 5.6 years ago by Fallino20
1
gravatar for Israel Barrantes
5.6 years ago by
Germany
Israel Barrantes670 wrote:

If you have a list of sequence IDs that you want to extract, you can use the UCSC utilities faOneRecord (to extract just a single record), or faSomeRecords (multiple sequences):

$ faSomeRecords inputfile.fasta listfile outputfile.fasta

Also, using the option '-exclude' will output sequences not present in 'listfile'

ADD COMMENTlink written 5.6 years ago by Israel Barrantes670

Hi Israel Barrantes,

I used below cmd to extract RNA seq from fasta file using ID number present in text file.

faSomeRecords nr_latest.fasta sequence.gi.txt outputfile.fasta

After above run was finished, outputfile that is outputfile.fasta has 0 byte size.

Can you please give me suggestion, what am I doing wrong.

Thanks in advance!!

I would really appreciate your help,

Thanks,

Naresh

ADD REPLYlink written 3.9 years ago by nbvasani200
0
gravatar for cpcantalapiedra
18 months ago by
Spain
cpcantalapiedra140 wrote:

If besides an identifier you had sequence length, you could create a simple BED file and use "bedtools getfasta".

An example would be:

  • Create a bed file from your whole fasta file.
  • Grep the bed lines of your identifers and pipe it to bedtools getfasta -fi whole_fasta -bed - -fo myfasta
ADD COMMENTlink written 18 months ago by cpcantalapiedra140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1519 users visited in the last hour