Question: Filtering sequences from a Fasta file using Biopython
2
gravatar for tpaisie
2.4 years ago by
tpaisie70
University of Florida
tpaisie70 wrote:

Hi everyone, I'm pretty sure this would be a super simple python script using Biopython, but could someone point me in the right direction? I have a Fasta file with 229 sequences, they are named by only their genbank accession number. I have a text file with the genbank accession numbers of the sequences I want to extract from the original fasta file with the 229 sequences. I would like to be able to filter out the sequences listed in my text file to create a new fasta file, with only my sequences of interest.

I feel like I accidentally did this once when I was trying to write a script for another purpose. I also am trying to actively learn biopython, so just a push in the right direction would help a lot!! Thanks!

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by tpaisie70
1
gravatar for mforde84
2.4 years ago by
mforde841.2k
mforde841.2k wrote:

You're making this way too hard on yourself. Bash is pretty awesome if you give it a try. :)

$ head lookup.txt;
GI123
GI987
GI3873
$ while read -r line; do
     sed -n -e "/>$line/,/>/ p" sequence.fasta | head -n-1 >> newfasta.txt;
  done < "lookup.txt";
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by mforde841.2k
1

Definitely agree with the usefulness of bash for simple tasks, but in the spirit of learning biopython for more complex tasks I understand OP tries Biopython for this job as well.

(And I think I don't need a lot more lines in Python to do the same job ;-)
Constructs like "/>$line/,/>/ p" scare me. )

ADD REPLYlink written 2.4 years ago by WouterDeCoster39k
1
gravatar for jrj.healey
2.4 years ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

Wrote a multipurpose fasta-fetcher a while back here: https://github.com/jrjhealey/bioinfo-tools/blob/master/fastafetcher.py

It supports finding the right fasta by a file of keys, by string directly, and also has an 'invert' parameter to retrieve everything but the ones you list. Usage I think is pretty self explanatory (script has a help function thanks to argparse) but feel free to ask for clarification:

python fastafetcher.py -k keyfile.txt -f fastafile.fasta [--invert]

key file has to be a simple list of exact strings, one per line.

# Extract fasta files by their descriptors stored in a separate file.
# Requires biopython

import StringIO
from Bio import SeqIO
import sys
import traceback
import warnings
import argparse

def getKeys(keyFile):
    """Turns the input key file into a tuple. May be memory intensive."""

    with open(keyFile, "r") as kfh:
        keys = []
            for line in kfh:
            line = line.rstrip('\n')
            line = line.lstrip('>')
            keys.append(line)

    return keys

def main():
    """Takes a list of strings in a text file (one per line) and retreives them and their sequences from a provided multifasta."""
    # Parse arguments from the commandline:
    try:
        parser = argparse.ArgumentParser(description='Retrieve one or more fastas from a given multifasta.')
        parser.add_argument(
            '-f',
            '--fasta',
            action='store',
            required=True,
            help='The multifasta to search.')
        parser.add_argument(
            '-k',
            '--keys',
            action='store',
            required=True,
            help='A file of header strings to search the multifasta for. Must be exact. Must be one per line.')
        parser.add_argument(
            '-o',
            '--outfile',
            action='store',
            default=None,
            help='Output file to store the new fasta sequences in. Just prints to screen by default.')
        parser.add_argument(
            '-v',
            '--verbose',
            action='store_true',
            help='Set whether to print the key list out before the fasta sequences. Useful for debugging.')
        parser.add_argument(
            '-i',
            '--invert',
            action='store_true',
            help='Invert the search, and retrieve all sequences NOT specified in the keyfile.')
        args = parser.parse_args()

    except:
        print "An exception occured with argument parsing. Check your provided options."
        traceback.print_exc()

    # Rename args to enable them to be provided to the getKeys function:
    keyFile = args.keys
    inFile = args.fasta
    outFile = args.outfile
# Main code:
    # Call getKeys() to create the tuple of keys from the provided file:
    try:
        keys = getKeys(keyFile)
    except IOError:
        keys = args.keys


    if args.verbose is not False:
        if args.invert is False:
            print('Fetching the following keys from: ' + inFile)
            for key in keys:
                print(key)
        else:
            print('Ignoring the following keys, and retreiving everything else from: ' + inFile)
            for key in keys:
                print(key)

    # Parse in the multifasta and assign an iterable variable:
        seqIter = SeqIO.parse(inFile, 'fasta')

    # For each sequence in the multifasta, check if it's in the keys[] tuple. If so, print it out:
    for seq in seqIter:
        if args.invert is False:
            if seq.id in keys:
                print(seq.format("fasta"))
            if args.outfile is not None:
                SeqIO.write(seq, outFile, "fasta")
        else:
    # If the --invert/-i flag is given, print all fastas NOT listed in the keyfile
            if seq.id not in keys:
                print(seq.format("fasta"))
            if args.outfile is not None:
                SeqIO.write(seq, outFile, "fasta")

if __name__ == "__main__":
    main()
ADD COMMENTlink written 2.4 years ago by jrj.healey12k
1

Looks pretty, nicely commented too.
But don't you mean "list" where you write tuple? That's not the same structure :p Also, I would rewrite getKeys() as below:

def getKeys(keyFile):
    """Turns the input key file into a list. May be memory intensive."""
    with open(keyFile, "r") as kfh:
        return([line.lstrip('>') for line in kfh if not line = ""])
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by WouterDeCoster39k
1

Oh very probably :P I'm still fairly novice when it comes to python scripting so I tend to write things in an overly vebose way just partly because thats how I think about the problem, but also because I usually give the scripts to others in my lab who are even less well versed in it than I am so I figure it makes it easier to follow :P comments and criticisms are always welcome for my own learning anyway :)

ADD REPLYlink written 2.4 years ago by jrj.healey12k

There is for sure nothing wrong with writing verbose code, definitely a plus for readability. I think my version would be a bit faster, but for most (small) lists the difference is probably neglectable.

ADD REPLYlink written 2.4 years ago by WouterDeCoster39k
0
gravatar for WouterDeCoster
2.4 years ago by
Belgium
WouterDeCoster39k wrote:

I assume you went through the Biopython Tutorial and Cookbook?

This is indeed pretty straightforward. Read the text file with accession numbers in a list (desiredList), iterate over your fasta file and if the record.id attribute is in the desiredList, add it to an output list. Then after looping, write the desired sequences to an output file. Need more pointers?

ADD COMMENTlink written 2.4 years ago by WouterDeCoster39k

so biopython has this as their filtering example:

from Bio import SeqIO
input_file = "big_file.sff"
id_file = "short_list.txt"
output_file = "short_list.sff"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print("Found %i unique identifiers in %s" % (len(wanted), id_file))
records = (r for r in SeqIO.parse(input_file, "sff") if r.id in wanted)
count = SeqIO.write(records, output_file, "sff")
print("Saved %i records from %s to %s" % (count, input_file, output_file))
if count < len(wanted):
    print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

so when i modify this example to my files it gives me only one sequence in my output fasta file, I guess i'm confused why its doing that.

ADD REPLYlink modified 2.4 years ago by WouterDeCoster39k • written 2.4 years ago by tpaisie70

I assume you adapted the code to work for fasta format? Could you show your code as well? This seems right...

ADD REPLYlink written 2.4 years ago by WouterDeCoster39k
from Bio import SeqIO
input_file = "cov_uniq_aln_103016.fasta"
id_file = "CoV_seq_names_010517.txt"
output_file = "cov_less_seqs_010517.fasta"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print("Found %i unique identifiers in %s" % (len(wanted), id_file))
records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")
print("Saved %i records from %s to %s" % (count, input_file, output_file))
if count < len(wanted):
    print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

Maybe i'm missing something in the code for the fasta format? it does output a fasta file, just only with one sequence instead of the 80 or so sequence ids in the text file.

ADD REPLYlink modified 2.4 years ago by WouterDeCoster39k • written 2.4 years ago by tpaisie70

And the wanted variable looks okay? It contains the expected number of identifiers (and the expected values)?

ADD REPLYlink written 2.4 years ago by WouterDeCoster39k

This is my output from the terminal:

Found 1 unique identifiers in CoV_seq_names_010517.txt Saved 1 records from cov_uniq_aln_103016.fasta to cov_less_seqs_010517.fasta

And the output fasta file has only the first sequences on my id_file, so I think the problem is this line:

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by tpaisie70

Indeed, only one identifier was found. You will need to modify the line you mentioned. How does your id_file look like? How is it separated?

ADD REPLYlink written 2.4 years ago by WouterDeCoster39k
0
gravatar for tpaisie
2.4 years ago by
tpaisie70
University of Florida
tpaisie70 wrote:

So I figured it out! The original script works, I saved my id_file has a tab delimited file in excel and that was the problem! You guys are really motivating me to work on my python!

ADD COMMENTlink written 2.4 years ago by tpaisie70

Good to hear! You were pretty close to the solution yourself, too.

ADD REPLYlink written 2.4 years ago by WouterDeCoster39k

Excel strikes again!

ADD REPLYlink written 2.4 years ago by jrj.healey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1014 users visited in the last hour