Extract A Group Of Fasta Sequences From A File
6
6
Entering edit mode
14.2 years ago
Kyle ▴ 60

I have a group of fasta sequences numbered sequentially 1,2, etc. and would like to extract a subset by their numbers/identifiers. Using bp_index and bp_fetch in bioperl enables me to extract a single sequences, but I am not familiar enough with perl to automate this for a large group. Any suggestions, ideally in python but also in perl? Thanks.

fasta python perl biopython bioperl • 44k views
ADD COMMENT
18
Entering edit mode
14.2 years ago

Hi,

Here is a quick solution in Python.

from Bio import SeqIO

fasta_file = "fasta_file.fasta" # Input fasta file
wanted_file = "wanted_file.txt" # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")

Cheers!

ADD COMMENT
3
Entering edit mode

i think that's a poor representation of how easy it is to do this in biopython (sorry eric, but ...). how about something like this:

http://gist.github.com/477969

looks less like java, uses less code, and is more flexible.

ADD REPLY
1
Entering edit mode

@eric, what does (number_file) stands for, in your code? Just because I don't understand it. Is it the wanted_file? Thank u!

ADD REPLY
1
Entering edit mode

Hi @peixe, you had guessed right. I changed it to 'wanted_file'. Cheers

ADD REPLY
1
Entering edit mode

Just found this response (after posting on 2827),but the results file does not actually get anything written to it. One main issue I can think of is I am just trying to compare "seq.id" and not the full "seq.description". Does that make a difference - if my seq.id is less than the full fasta >seq.description? Sorry for the confusion, thanks. -e

ADD REPLY
1
Entering edit mode

Yes, it matters, .id is the first word only, .description is the full title line from the > line in the FASTA file.

ADD REPLY
0
Entering edit mode

Wow ! great thanks!

ADD REPLY
0
Entering edit mode

@brentp I think the 'try/except' may be a little scary, but I find them necessary to insure that one gets a sensible warning if ever something goes wrong. Something more precise than 'go look at line 43, index is too big'. :) Your example is very beautiful, and much more pythonesque than my own, however, I wanted to avoid reading the whole file at once. That way, a very large fasta file can be read on a typical student laptop. Since I am toying with NGS data these days, I tend to avoid loading whole files :) Thanks for the very neat example!

ADD REPLY
0
Entering edit mode

@eric, my example never holds the entire fasta file in memory. your code will give a message if the header doesn't exist in the file, but that's just one more line!

ADD REPLY
0
Entering edit mode

@brentp Sorry, I thought you were using a list comprehension in the SeqIO.write function. Can I then use your suggestions to update my code and make it better? Would you rather post an answer yourself? In any case, thanks for the insight. I've been thinking all night long that there may be something I was missing in my current approach :)

ADD REPLY
0
Entering edit mode

@eric, of course you can use the suggestions to update. your answer does the right thing, i just wanted to point out a terser approach.

ADD REPLY
0
Entering edit mode

Perfect! Thanks @eric!

ADD REPLY
0
Entering edit mode

Current version has a bug, rather than "if name in wanted" should be "if seq.id in wanted" or "if seq.name in wanted" - you haven't even got a variable called name.

But it should be faster to use a generator expression as in brentp's comment.

ADD REPLY
0
Entering edit mode

Hi @Peter. You are right on both points. There was a bug (corrected now) and brentp's version is better, I think for quite a few reasons. Cheers

ADD REPLY
0
Entering edit mode

@Eric Thank you!!!

ADD REPLY
0
Entering edit mode

Whenever I try to run the script it says

 File "peptideextract.py", line 12, in <module>
    fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
NameError: name 'SeqIO' is not defined

Where is SeqIO supposed to be defined? Is it supposed to be a column within our fasta and/or text file?

Thanks!

ADD REPLY
0
Entering edit mode

Have you imported the module? Like this:

from Bio import SeqIO

p.

ADD REPLY
0
Entering edit mode
<<<<<<<<<<<<<<<<<
ADD REPLY
0
Entering edit mode

Use a for loop over the files found with os.listdir(...) or the glob module?

ADD REPLY
0
Entering edit mode

Hello I have troubles running this script. I have an ID's file with exactly the same header as the fasta file, the results file is generated but is always empty. What am I doing wrong?

Thanks

ADD REPLY
0
Entering edit mode

Thanks very much for this! I have made a few modifications to Eric's script so it can be easily run on different files. I have it saved as 'seq_extractor.py' on my computer:

# Execute by: python seq_extractor.py readsList fastafile outputfile

from Bio import SeqIO
import sys

readsList = open(sys.argv[1], 'rU')
fastafile = sys.argv[2]
outputfile = open(sys.argv[3], 'w')

wanted = set()
with readsList as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

fasta_sequences = SeqIO.parse(open(fastafile),'fasta')

with outputfile as i:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], i, "fasta")

readsList.close()
outputfile.close()

Hope that helps.

ADD REPLY
12
Entering edit mode
14.2 years ago
brentp 24k

FWIW, you can do this from the commandline if you have pyfasta installed with:

pyfasta extract --header --fasta my.fa header1 header2 header3 subset.fa

or if the headers you want are listed in a file:

pyfasta extract --header --fasta my.fa --file header-list.txt > subset.fa
ADD COMMENT
0
Entering edit mode

I tried this I get Error

Traceback (most recent call last):
  File "/usr/local/bin/pyfasta", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/pyfasta/__init__.py", line 38, in main
    globals()[action](sys.argv[2:])
  File "/usr/local/lib/python3.8/site-packages/pyfasta/__init__.py", line 134, in extract
    seq = f[seqname]
  File "/usr/local/lib/python3.8/site-packages/pyfasta/fasta.py", line 128, in __getitem__
    c = self.index[i]
KeyError: '>MITEs101'
ADD REPLY
2
Entering edit mode
14.2 years ago
User 59 13k

This question I believe has already been quite well answered here.

ADD COMMENT
1
Entering edit mode

The OP is asking for a Python solution preferably, which the other questions aswers do not provide :)

ADD REPLY
0
Entering edit mode

The OP is asking for a python solution here, not a Perl one :)

ADD REPLY
0
Entering edit mode

no but he did say 'and also in Perl' which the linked answer did. I leave it to y u to pick up the reputation for the accepted solution :D

ADD REPLY
0
Entering edit mode

You are very kind to leave it to me. However, I had no choice in making the decision of accepting this answer. I merely suggested one approach which, to my better knowledge, corresponded to the OP's need. Cheers

ADD REPLY
0
Entering edit mode

You are very kind to leave it to me. However, I had no choice in making the decision of accepting this answer. I merely suggested one approach which, to my best knowledge, corresponded to the OP's described need. Cheers

ADD REPLY
1
Entering edit mode
12.5 years ago
Fallino ▴ 20

Here is a solution with GNU pcregrep, this is what I generally use to extract an entry (title+sequence) from a fasta file (in this example gi|116108791 and gi|30261391 in NCBI nr.fasta, you can add more patterns with -e) :

pcregrep --buffer-size=16M -M -e 'gi[[:punct:]]116108791[[:punct:]].*\s([[:alpha:]]+\s*)*' -e 'gi[[:punct:]]30261391[[:punct:]].*\s([[:alpha:]]+\s*)*' nr.fasta

--buffer-size to increase the default buffer-size which was not sufficient for NCBInr

-M for multiline matching

-e for adding a pattern

Depending of your platform, use single/double quotes. This has been tested with the Windows GNU cygwin version 8.21 2011-12-12 of PCREGREP but should work with the pcregrep binary of every Linux distribution. The main limitation is that it has to parse the entire file to return control. I tried to add the '--match-limit' parameter to the number of patterns but gave me some weird errors.

ADD COMMENT
1
Entering edit mode
12.5 years ago

If you have a list of sequence IDs that you want to extract, you can use the UCSC utilities faOneRecord (to extract just a single record), or faSomeRecords (multiple sequences):

$ faSomeRecords inputfile.fasta listfile outputfile.fasta

Also, using the option '-exclude' will output sequences not present in 'listfile'

ADD COMMENT
0
Entering edit mode

Hi Israel Barrantes,

I used below cmd to extract RNA seq from fasta file using ID number present in text file.

faSomeRecords nr_latest.fasta sequence.gi.txt outputfile.fasta

After above run was finished, outputfile that is outputfile.fasta has 0 byte size.

Can you please give me suggestion, what am I doing wrong.

Thanks in advance!!

I would really appreciate your help,

Thanks,

Naresh

ADD REPLY
0
Entering edit mode
8.4 years ago

If besides an identifier you had sequence length, you could create a simple BED file and use "bedtools getfasta".

An example would be:

  • Create a bed file from your whole fasta file.
  • Grep the bed lines of your identifers and pipe it to bedtools getfasta -fi whole_fasta -bed - -fo myfasta
ADD COMMENT

Login before adding your answer.

Traffic: 1414 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6