I add a new solution using biopython able to remove the sequence with identifiers in a given file. I consider Biopython more reliable and reproducible than grep/sed/awk when working with a great number of sequences.
As I do not know your experience with python, I will explain the solution in detail:
1) Ensure you have python and biopython installed. Type in your terminal:
python -c "import Bio"
echo $?
If "0" displayed, pass to step 2. In case error message appears, install it by easy_install
or pip
:
sudo easy_install -f http://biopython.org/DIST/ biopython
or
sudo pip install biopython
In case of not having pip installed, type:
sudo apt-get install pip
2) Create a script ".py" with this script ( filter.py, for example ) using your favourite text editor:
#!/usr/bin/env python
import sys
from sets import Set
from Bio import SeqIO
def list_ids():
"""
Return a set containing the identifiers presented in a file,
line by line, starting with ">"
"""
# read the first file given and generate a set (faster iteration respect lists
identifiers = Set([])
with open(sys.argv[1], 'r') as fi:
for line in fi:
line = line.strip()
identifiers.add(str(line).replace(">", ""))
return identifiers
def filter():
"""
Writes a file containing only the sequences with identifiers NOT
present in a set of identifiers
"""
identifiers = list_ids()
with open(sys.argv[2]) as original_fasta, open(sys.argv[3], 'w') as corrected_fasta:
records = SeqIO.parse(original_fasta, 'fasta')
for record in records:
print record.id
if record.id not in identifiers:
SeqIO.write(record, corrected_fasta, 'fasta')
if __name__ == '__main__':
filter()
3) Run the program as follows:
chmod +x filter.py
./filter.py your_ids.txt IN.fasta OUT.fasta
where:
your_ids.txt
--> A file containing the identifiers including >
you want to EXCLUDE, one identifier per line; like:
>hhh
>ghag
>MAMND
IN.fasta
--> your original fasta file
OUT.fasta
--> your output file
Thank you very much for your help. The script works well.
You're welcome, happy to help!
Thank you very much. It totally worked :)
Happy to hear that, you're welcome!
Will this script still work with python3 ? How can I modify it so it will ?
I had the same problem, I tried several alternatives in the windows subsystem for Linux, until I found your script and it has been very helpful. Thank you samuelmiver
Hi samuelmiver,
When I run
filter.py
, it says