How to remove sequences from a fasta file based on ID list?
1
0
Entering edit mode
7.0 years ago
jan • 0

Hi all, I am looking for a way that allows me to filter out (remove) some sequences from a fasta file (with aligned protein sequences) based on a list of headers given as separate input file. I am aware that there are similar questions here but in most cases, the given answers show how to get and not how to remove the sequences. I also tried the suggested software (e.g., BBMap) but this was either destroying my alignment or/and replacing ā€œ-ā€œ by ā€œNā€, which is particularly bad for protein alignments. Any idea how to do this? A bash, python, or perl script would be great. Many thanks in advance!

### fasta file ###

>Species_X
PFEAIQIINLPHRYGANTFKLHRLPVPRPGQVLGLVGTNGIGKSTALKILAGKLKPNLGR
FTSPPDWQEILTHFRGSELQNYFTRILEDNLKAIIKPQYVDHIPLSGGELQRFAIAVVAI
QNAEIYMFDEPSSYLDVKQRLKAAQVVRSYVIVVEHDLSVLDYLSDFICCLYGKPGAYGV
VTLPFSVREGINIFLAGFVPTENLRFRDESLTFKGEFTDSQIIVMLGENGTGKTTFIRML
AGLLNVSYKPQKISPKFQNSVRHLLHQKIRDSYMHPQFMSDVMKPLQIEQLMDQEVVNLS
GGELQRVALTLCLGKPADIYLIDEPSAYLDSEQRIVASKVIKRFILHAKKTAFVVEHDFI
MATYLADRVIVYEGQPSIDCTANCPQSLLSGMNLFLSHLNITFRRDPTNFRPRINKLEST
KDREQKSAGSYY
>Species_Y
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------------------------------------------MLGENGTGKTTFIRML
AG--NVSYKPQ--------TVRQLLHDKIRDAYTHPQFVSDVIRPLQIEQLLDQVVKTLS
GGEKQRVAITLCLGKPADIYLIDEPSAHLDSEQRITASKVIKRFILHAKKTAFIVEHDFI
MATYLADRVIVYEGQPAVKCIAHSPQSLLSGMNLFLSHLNITFRRDPTNFRPRINKLESI
KDKEQKTAGSYY
>Species_Z
PFGAIHIINLPHRYSANSFKLHRLPMPRPGQVLGLVGTNGIGKSTALKILSGKLKPNLGR
FDNPPDWEEILKYFRGSELQNYFTKVLEDDLKAVVKPQYVDQIPLSGGELQRFAIGLVCV
QKADVYMFDEPSSYLDVKQRLAAARSIREYVIVVEHDLSVLDYLSDFVCVLYGRPALYGV
VTLPASVREGINIFLDGHIPTENLRFREESLTFRGSFTDSEIIVMMGENGTGKTTFCKML
AGAENISMKPQKITPKFQGTVRQLFFKRIKAAFLSPQFQTDVYKPLKIDDFIDQEVQNLS
GGELQRVAIVLALGIPADIYLIDEPSAYLDSEQRIVASRVIKRFIMHTKKTAFIVEHDFI
MATYLADRVIVFDGQPSVDAHANAPESLVTGCNTFLKNLDVTFRRDPNSYRPRINKYQSQ
MDQEQKLAGNY-

### to_remove.txt ###

Species_X
Species_Z

### desired output ###

>Species_Y
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------------------------------------------MLGENGTGKTTFIRML
AG--NVSYKPQ--------TVRQLLHDKIRDAYTHPQFVSDVIRPLQIEQLLDQVVKTLS
GGEKQRVAITLCLGKPADIYLIDEPSAHLDSEQRITASKVIKRFILHAKKTAFIVEHDFI
MATYLADRVIVYEGQPAVKCIAHSPQSLLSGMNLFLSHLNITFRRDPTNFRPRINKLESI
KDKEQKTAGSYY
sequence alignment • 6.0k views
ADD COMMENT
1
Entering edit mode

Did you search yourself already for a solution on this site? This is asked a zillion times already and answered as well.

ADD REPLY
2
Entering edit mode
7.0 years ago
GenoMax 141k

Get faSomeRecords utility from Jim Kent. Add execute permissions if needed (chmod u+x faSomeRecords).

Run it like this: ./faSomeRecords -exclude your_original.fa id_to_exclude final.fa

List id's to exclude in id_to_exclude file, one on each line.

ADD COMMENT
0
Entering edit mode

Thank you! This works well on Linux (couldn't make it run on OSX before and therefore hoped to find a raw/executable script here).

ADD REPLY
1
Entering edit mode

Here is a link to macOS version of the same program in case someone finds this thread in future. Rest of the instructions are the same.

ADD REPLY

Login before adding your answer.

Traffic: 3205 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6