How to delete seq_id + sequence from FASTA file matching specific seq_id
3
0
Entering edit mode
7.1 years ago

Hi, I should delete sequences from a FASTA file and I have a list of all the sequences to delete.

Since I have to remove the sequence ID and the sequence I have to delete two lanes when there is the seq_id of interest.

I tried to use grep with the -v option:

grep -vA1 M03047:9:000000000-AD969:1:1101:16618:2205/1 file1 > file2


but it removes only the first line (the one with the seq_id): it seems that the -v option has effect only on -A and not on -A1

I tried other ways with sed instead of grep but I always get some problems, grep seems the best way but I have to figure out how to remove not only the first lane but also the second one.

Any ideas?

Thank you

grep FASTA • 10k views
5
Entering edit mode
7.1 years ago
samuelmiver ▴ 440

I add a new solution using biopython able to remove the sequence with identifiers in a given file. I consider Biopython more reliable and reproducible than grep/sed/awk when working with a great number of sequences.

As I do not know your experience with python, I will explain the solution in detail:

1) Ensure you have python and biopython installed. Type in your terminal:

python -c "import Bio"
echo $?  If "0" displayed, pass to step 2. In case error message appears, install it by easy_install or pip: sudo easy_install -f http://biopython.org/DIST/ biopython  or sudo pip install biopython  In case of not having pip installed, type: sudo apt-get install pip  2) Create a script ".py" with this script ( filter.py, for example ) using your favourite text editor: #!/usr/bin/env python import sys from sets import Set from Bio import SeqIO def list_ids(): """ Return a set containing the identifiers presented in a file, line by line, starting with ">" """ # read the first file given and generate a set (faster iteration respect lists identifiers = Set([]) with open(sys.argv[1], 'r') as fi: for line in fi: line = line.strip() identifiers.add(str(line).replace(">", "")) return identifiers def filter(): """ Writes a file containing only the sequences with identifiers NOT present in a set of identifiers """ identifiers = list_ids() with open(sys.argv[2]) as original_fasta, open(sys.argv[3], 'w') as corrected_fasta: records = SeqIO.parse(original_fasta, 'fasta') for record in records: print record.id if record.id not in identifiers: SeqIO.write(record, corrected_fasta, 'fasta') if __name__ == '__main__': filter()  3) Run the program as follows: chmod +x filter.py ./filter.py your_ids.txt IN.fasta OUT.fasta  where: • your_ids.txt --> A file containing the identifiers including > you want to EXCLUDE, one identifier per line; like: >hhh >ghag >MAMND  • IN.fasta --> your original fasta file • OUT.fasta --> your output file ADD COMMENT 0 Entering edit mode Thank you very much for your help. The script works well. ADD REPLY 0 Entering edit mode You're welcome, happy to help! ADD REPLY 0 Entering edit mode Thank you very much. It totally worked :) ADD REPLY 0 Entering edit mode Happy to hear that, you're welcome! ADD REPLY 0 Entering edit mode Will this script still work with python3 ? How can I modify it so it will ? ADD REPLY 0 Entering edit mode I had the same problem, I tried several alternatives in the windows subsystem for Linux, until I found your script and it has been very helpful. Thank you samuelmiver ADD REPLY 0 Entering edit mode Hi samuelmiver, When I run filter.py, it says  File "/home/zwl/test/./filter.py", line 35 print record.id ^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print(record.id)?  ADD REPLY 2 Entering edit mode 7.1 years ago michael.ante ★ 3.7k You can also have a look at FAST (Fast Analysis of Sequences Toolbox). There you have fasgrep: fasgrep -v "M03047:9:000000000-AD969:1:1101:16618:2205/1" file1 > file2  It will remove the whole entry, but change the window-length. You could adjust it with the fastx toolkit: fasta_formatter -i file2 -o file3 -w 60  ADD COMMENT 0 Entering edit mode Thank you michael.ante ADD REPLY 1 Entering edit mode 7.1 years ago samuelmiver ▴ 440 This is a perfect case where awk can save you: cat your_fasta.fasta | awk '{if (substr($0,1) == ">M03047:9:000000000-AD969:1:1101:16618:2205/1") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print$0}' > fixed.fasta


I have just adapted one example found here

0
Entering edit mode

I would ask for a second advice: I have to repeat this step (delete sequences) for 100K times or more.

Is there any way to use a file containing the list of sequences to remove instead of creating a script for everyone?

0
Entering edit mode

Any problem with (bio)python?

0
Entering edit mode

It would be a good idea to add this requirement to the initial message (EDIT button) in order to help and correctly inform further users about the whole problem.

0
Entering edit mode

Simpler awk code for deleting record:

cat your_fasta.fasta | awk '/>M03047:9:000000000-AD969:1:1101:16618:2205/ {getline; while(!/>/) {getline}} 1' > fixed.fasta