Question: How to delete seq_id + sequence from FASTA file matching specific seq_id
0
gravatar for ste.colombo
4.0 years ago by
European Union
ste.colombo0 wrote:

Hi, I should delete sequences from a FASTA file and I have a list of all the sequences to delete.

Since I have to remove the sequence ID and the sequence I have to delete two lanes when there is the seq_id of interest.

I tried to use grep with the -v option:

 grep -vA1 M03047:9:000000000-AD969:1:1101:16618:2205/1 file1 > file2

but It removes only the first line (the one with the seq_id): it seems that the -v option has effect only on -A and not on -A1

I tried other ways with sed instead of grep but I always get some problems, grep seems the best way but I have to figure out how to remove not only the first lane but also the second one.

 

Any ideas?

Than you

 

 

 

 

fasta file grep • 5.2k views
ADD COMMENTlink modified 4.0 years ago by samuelmiver420 • written 4.0 years ago by ste.colombo0

Thank you very much for your help. The script works well.

 

ADD REPLYlink written 4.0 years ago by ste.colombo0

You're welcome, happy to help!

ADD REPLYlink written 4.0 years ago by samuelmiver420
5
gravatar for samuelmiver
4.0 years ago by
samuelmiver420
Centre for Genomic Regulation (Barcelona, Spain)
samuelmiver420 wrote:

I add a new solution using biopython able to remove the sequence with identifiers in a given file. I consider Biopython more reliable and reproducible than grep/sed/awk when working with a great number of sequences.

As I do not know your experience with python, I will explain the solution in detail:

1) Ensure you have python and biopython installed. Type in your terminal:

python -c "import Bio"

echo $?

If "0" displayed, pass to step 2. In case error message appears, install it by easy_install or pip:

sudo easy_install -f http://biopython.org/DIST/ biopython

or

sudo pip install biopython

In case of not having pip installed, type:

sudo apt-get install pip

2) Create a script ".py" with this script ( filter.py, for example ) using your favourite text editor:

#!/usr/bin/env python

import sys
from sets import Set
from Bio import SeqIO

def list_ids():
    """
    Return a set containing the identifiers presented in a file,
    line by line, starting with ">"
    """

    # read the first file given and generate a set (faster iteration respect lists

    identifiers = Set([])

    with open(sys.argv[1], 'r') as fi:
        for line in fi:
            line = line.strip()
            identifiers.add(str(line).replace(">", ""))

    return identifiers

def filter():
    """
    Writes a file containing only the sequences with identifiers NOT
    present in a set of identifiers
    """

    identifiers = list_ids()

    with open(sys.argv[2]) as original_fasta, open(sys.argv[3], 'w') as corrected_fasta:
        records = SeqIO.parse(original_fasta, 'fasta')
        for record in records:
            print record.id
            if record.id not in identifiers:
                SeqIO.write(record, corrected_fasta, 'fasta')

if __name__ == '__main__':
    filter()

3) Run the program as follows:

chmod +x filter.py

./filter.py your_ids.txt IN.fasta OUT.fasta

where:

  • your_ids.txt --> A file containing the identifiers including ">" you want to EXCLUDE, one identifier per line; like:

>hhh 
>ghag 
>MAMND
  • IN.fasta --> your original fasta file
  • OUT.fasta --> your output file

 

 

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by samuelmiver420

Thank you very much. It totally worked :)

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by ste.colombo0

Happy to hear that, you're welcome!

ADD REPLYlink written 4.0 years ago by samuelmiver420

Will this script still work with python3 ? How can I modify it so it will ?

ADD REPLYlink written 13 months ago by gabrielle.jaquemet.10

I had the same problem, I tried several alternatives in the windows subsystem for Linux, until I found your script and it has been very helpful. Thank you samuelmiver

ADD REPLYlink written 4 months ago by luistatlav0
2
gravatar for michael.ante
4.0 years ago by
michael.ante3.3k
Austria/Vienna
michael.ante3.3k wrote:

You can also have a look at FAST (Fast Analysis of Sequences Toolbox). There you have fasgrep:

fasgrep -v "M03047:9:000000000-AD969:1:1101:16618:2205/1" file1 > file2

It will remove the whole entry, but change the window-length. You could adjust it with the fastx toolkit:

fasta_formatter -i file2 -o file3 -w 60
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by michael.ante3.3k
1
gravatar for samuelmiver
4.0 years ago by
samuelmiver420
Centre for Genomic Regulation (Barcelona, Spain)
samuelmiver420 wrote:

This is a perfect case where awk can save you:

cat your_fasta.fasta | awk '{if (substr($0,1) == ">M03047:9:000000000-AD969:1:1101:16618:2205/1") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > fixed.fasta

I have just adapted one example found here http://seqanswers.com/forums/showthread.php?t=24443

ADD COMMENTlink written 4.0 years ago by samuelmiver420

I would ask for a second advice: I have to repeat this step (delete sequences) for 100K times or more.

Is there any way to use a file containing the list of sequences to remove instead of creating a script for everyone?

ADD REPLYlink written 4.0 years ago by ste.colombo0

Any problem with (bio)python?

ADD REPLYlink written 4.0 years ago by samuelmiver420

It would be a good idea to add this requirement to the initial message (EDIT button) in order to help and correctly inform further users about the whole problem.

ADD REPLYlink written 4.0 years ago by samuelmiver420

Simpler awk code for deleting record:

cat your_fasta.fasta | awk '/>M03047:9:000000000-AD969:1:1101:16618:2205/ {getline; while(!/>/) {getline}} 1' > fixed.fasta
ADD REPLYlink written 21 months ago by farman0
0
gravatar for ste.colombo
4.0 years ago by
European Union
ste.colombo0 wrote:

Thank you michael.ante

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by ste.colombo0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 568 users visited in the last hour