Question

How to delete seq_id + sequence from FASTA file matching specific seq_id

0

Entering edit mode

9.2 years ago

ste.colombo • 0

Hi, I should delete sequences from a FASTA file and I have a list of all the sequences to delete.

Since I have to remove the sequence ID and the sequence I have to delete two lanes when there is the seq_id of interest.

I tried to use grep with the -v option:

grep -vA1 M03047:9:000000000-AD969:1:1101:16618:2205/1 file1 > file2

but it removes only the first line (the one with the seq_id): it seems that the -v option has effect only on -A and not on -A1

I tried other ways with sed instead of grep but I always get some problems, grep seems the best way but I have to figure out how to remove not only the first lane but also the second one.

Any ideas?

Thank you

grep FASTA • 13k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by ste.colombo • 0

Ram · Answer 1 · 2015-09-14

5

Entering edit mode

9.2 years ago

samuelmiver ▴ 440

I add a new solution using biopython able to remove the sequence with identifiers in a given file. I consider Biopython more reliable and reproducible than grep/sed/awk when working with a great number of sequences.

As I do not know your experience with python, I will explain the solution in detail:

1) Ensure you have python and biopython installed. Type in your terminal:

python -c "import Bio"
echo $?

If "0" displayed, pass to step 2. In case error message appears, install it by easy_install or pip:

sudo easy_install -f http://biopython.org/DIST/ biopython

or

sudo pip install biopython

In case of not having pip installed, type:

sudo apt-get install pip

2) Create a script ".py" with this script ( filter.py, for example ) using your favourite text editor:

#!/usr/bin/env python

import sys
from sets import Set
from Bio import SeqIO

def list_ids():
    """
    Return a set containing the identifiers presented in a file,
    line by line, starting with ">"
    """

    # read the first file given and generate a set (faster iteration respect lists

    identifiers = Set([])

    with open(sys.argv[1], 'r') as fi:
        for line in fi:
            line = line.strip()
            identifiers.add(str(line).replace(">", ""))

    return identifiers

def filter():
    """
    Writes a file containing only the sequences with identifiers NOT
    present in a set of identifiers
    """

    identifiers = list_ids()

    with open(sys.argv[2]) as original_fasta, open(sys.argv[3], 'w') as corrected_fasta:
        records = SeqIO.parse(original_fasta, 'fasta')
        for record in records:
            print record.id
            if record.id not in identifiers:
                SeqIO.write(record, corrected_fasta, 'fasta')

if __name__ == '__main__':
    filter()

3) Run the program as follows:

chmod +x filter.py
./filter.py your_ids.txt IN.fasta OUT.fasta

where:

your_ids.txt --> A file containing the identifiers including > you want to EXCLUDE, one identifier per line; like:
```
>hhh 
>ghag 
>MAMND
```
IN.fasta --> your original fasta file
OUT.fasta --> your output file

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by samuelmiver ▴ 440

0

Entering edit mode

Thank you very much for your help. The script works well.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by ste.colombo • 0

0

Entering edit mode

You're welcome, happy to help!

ADD REPLY • link 9.2 years ago by samuelmiver ▴ 440

0

Entering edit mode

Thank you very much. It totally worked :)

ADD REPLY • link 9.2 years ago by ste.colombo • 0

0

Entering edit mode

Happy to hear that, you're welcome!

ADD REPLY • link 9.2 years ago by samuelmiver ▴ 440

0

Entering edit mode

Will this script still work with python3 ? How can I modify it so it will ?

ADD REPLY • link 6.3 years ago by gabrielle.jaquemet.1 • 0

0

Entering edit mode

I had the same problem, I tried several alternatives in the windows subsystem for Linux, until I found your script and it has been very helpful. Thank you samuelmiver

ADD REPLY • link 5.5 years ago by luistatlav • 0

0

Entering edit mode

Hi samuelmiver,

When I run filter.py, it says

  File "/home/zwl/test/./filter.py", line 35
    print record.id
          ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(record.id)?

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 3.3 years ago by zinwinlatt14 • 0

Ram · Answer 2 · 2015-09-14

2

Entering edit mode

9.2 years ago

michael.ante ★ 3.9k

You can also have a look at FAST (Fast Analysis of Sequences Toolbox). There you have fasgrep:

fasgrep -v "M03047:9:000000000-AD969:1:1101:16618:2205/1" file1 > file2

It will remove the whole entry, but change the window-length. You could adjust it with the fastx toolkit:

fasta_formatter -i file2 -o file3 -w 60

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Thank you michael.ante

ADD REPLY • link 9.2 years ago by ste.colombo • 0

Ram · Answer 3 · 2015-09-14

1

Entering edit mode

9.2 years ago

samuelmiver ▴ 440

This is a perfect case where awk can save you:

cat your_fasta.fasta | awk '{if (substr($0,1) == ">M03047:9:000000000-AD969:1:1101:16618:2205/1") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > fixed.fasta

I have just adapted one example found here

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by samuelmiver ▴ 440

0

Entering edit mode

I would ask for a second advice: I have to repeat this step (delete sequences) for 100K times or more.

Is there any way to use a file containing the list of sequences to remove instead of creating a script for everyone?

ADD REPLY • link 9.2 years ago by ste.colombo • 0

0

Entering edit mode

Any problem with (bio)python?

ADD REPLY • link 9.2 years ago by samuelmiver ▴ 440

0

Entering edit mode

It would be a good idea to add this requirement to the initial message (EDIT button) in order to help and correctly inform further users about the whole problem.

ADD REPLY • link 9.2 years ago by samuelmiver ▴ 440

0

Entering edit mode

Simpler awk code for deleting record:

cat your_fasta.fasta | awk '/>M03047:9:000000000-AD969:1:1101:16618:2205/ {getline; while(!/>/) {getline}} 1' > fixed.fasta

ADD REPLY • link 6.9 years ago by farman • 0