Question: How To Remove Certain Sequences From A Fasta File
7
gravatar for Emmecola
6.7 years ago by
Emmecola150
Emmecola150 wrote:

Hi everyone

Is there a fast way to do this filter?

I have a huge Fasta file (sequences are short reads coming from an Illumina instrument). I have also a list of nucleotide sequences (not Fasta, just the sequences) and I want to remove from the big Fasta file all entries identical to those in the list.

My idea was simply to go down through the Fasta file and then, for every read, check all the sequences of the list. If the read matches one of the sequences then do nothing, otherwise print the read into a new file. I made this with perl but it takes ages!

The list is made up of nucleotide sequences, not IDs. It's something like this:

AACGACTACTTATCGATC

TCGGCGATATACGTAC

CCAGTTTCGGGGCTAT ....

Thanks!

fasta short parsing • 14k views
ADD COMMENTlink modified 6.7 years ago by Eric Normandeau9.5k • written 6.7 years ago by Emmecola150
1

can you make a better example... you have a file with a lot of sequences, and a list of sequence ids to remove from it. Is it correct?

ADD REPLYlink written 6.7 years ago by Giovanni M Dall'Olio25k
1

I'm fairly sure this has been asked/answered before - check the "Related" box or search the archives.

ADD REPLYlink written 6.7 years ago by Neilfws47k

Yes, except that the list is not made up of IDs. They are nucleotide sequences.

ADD REPLYlink written 6.7 years ago by Emmecola150

Do I understand correctly that you have a list of nucleotide sequences, and you want to go through a FASTA file and remove all entries that contain an exact, full-length hit to one of the nucleotide sequences in your list? Are the nucleotide sequences in your list by any chance all of the same length?

ADD REPLYlink written 6.7 years ago by Lars Juhl Jensen11k

Have a look at these two questions: http://biostar.stackexchange.com/questions/1094/visualizing-differences-between-a-lot-of-sequences http://biostar.stackexchange.com/questions/3010/how-to-remove-the-same-sequences-in-the-fasta-files ; also, make a search in the archives.

ADD REPLYlink written 6.7 years ago by Giovanni M Dall'Olio25k
12
gravatar for Pierre Lindenbaum
6.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:
  • linearize your fastq file into 3 columns: sequence, ID, qualities (using 'awk', or 'perl')
  • sort it using the sequence as the key ( unix 'sort')
  • sort your 2nd file of sequences ( unix 'sort')
  • use unix 'join' to filter out the first file and re-transform it to FASTQ using awk or perl
ADD COMMENTlink written 6.7 years ago by Pierre Lindenbaum98k

Pierre's solution is the way to go for sure, assuming you don't need to account for mismatches, INDELs, etc. Use the "-v" option in join to exclude the FASTQ sequences that match your filter set.

The beauty of this recipe is that is applies to many other scenarios and is not constrained by memory. I probably do something close to this recipe (in other situations) 100 times a year.

ADD REPLYlink written 6.7 years ago by Aaronquinlan9.9k
10
gravatar for Eric Normandeau
6.7 years ago by
Eric Normandeau9.5k
Quebec, Canada
Eric Normandeau9.5k wrote:

Hi,

EDIT: I had misunderstood that you actually had the nucleotide sequences in the exclusion file. I modified the code accordingly.

You can use the following script. Put it in a file named 'fasta_remove.py', make the file executable:

chmod +x fasta_remove.py

and use it by typing:

./fasta_remove.py <input_fasta> <remove_file> <output_fasta>

The 'removefile' file contains one sequence name per line that needs to be removed from the 'inputfasta' file. The input fasta file is not limited in size. You will need to have Biopython installed and this works under Linux, obviously.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file

remove = set()
with open(remove_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            remove.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        nuc = seq.seq.tostring()
        if nuc not in remove and len(nuc) > 0:
            SeqIO.write([seq], f, "fasta")

Cheers!

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Eric Normandeau9.5k

Hi

I am working with old sanger reads, and I have the fasta and qual files separated. I will like to remove some sequences from the qual file using a list of names of those sequences but I am having problems with these. ¿Do yo have a solution for this?

Thankyou very much

ADD REPLYlink written 2.2 years ago by shinken12310

ask this as a new question please.

ADD REPLYlink written 2.2 years ago by Pierre Lindenbaum98k

For me, I had to change the code to this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file

remove = set()
with open(remove_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            remove.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        **nam = seq.id
        **nuc = str(seq.seq)
        **if nam not in remove and len(nuc) > 0:
            SeqIO.write([seq], f, "fasta")

(I changed the lines beginning with '**' )

Versions: Python 2.7.11 and Biopython 1.67

ADD REPLYlink modified 14 months ago • written 14 months ago by teresalaguna0
1
gravatar for Grünschnabel
6.7 years ago by
Grünschnabel10 wrote:

An alternative to Pierres suggestion: use

grep -v SeqToBeDeleted inputfile > outputfile

and interate on all your Sequences to be removed.

ADD COMMENTlink written 6.7 years ago by Grünschnabel10
4

Unless the number of sequences that you want to remove is very small, this will be prohibitively slow. In any case, if you want to use grep for it, here are two simple optimizations: 1) use the -f option to give grep a file with all sequences that you want to search with instead of running grep multiple times, and 2) use fgrep instead of grep since you only need to do exact string matching and not full regular expression matching.

ADD REPLYlink written 6.7 years ago by Lars Juhl Jensen11k
2

Plus it will only remove the sequence part of the FASTQ entry and will result in a mangled file consisting of proper FASTQ entries that should remain and incomplete (i.e. head, separator, qual, yet no sequence) entries for the matched sequences.

ADD REPLYlink written 6.7 years ago by Aaronquinlan9.9k

Unless the number of sequences that you want to remove is very small, this will be prohibitively slow. In either case, if you want to use grep for it, here are two simple optimations: 1) use the -f option to give grep a file with all sequences that you want to search with instead of running grep multiple times, and 2) use fgrep instead of grep since you only need to do exact string matching and not full regular expression matching.

ADD REPLYlink written 6.7 years ago by Lars Juhl Jensen11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1261 users visited in the last hour