Question

How to remove sequences from a fasta file based on ID list?

0

Entering edit mode

7.0 years ago

jan • 0

Hi all, I am looking for a way that allows me to filter out (remove) some sequences from a fasta file (with aligned protein sequences) based on a list of headers given as separate input file. I am aware that there are similar questions here but in most cases, the given answers show how to get and not how to remove the sequences. I also tried the suggested software (e.g., BBMap) but this was either destroying my alignment or/and replacing “-“ by “N”, which is particularly bad for protein alignments. Any idea how to do this? A bash, python, or perl script would be great. Many thanks in advance!

### fasta file ###

>Species_X
PFEAIQIINLPHRYGANTFKLHRLPVPRPGQVLGLVGTNGIGKSTALKILAGKLKPNLGR
FTSPPDWQEILTHFRGSELQNYFTRILEDNLKAIIKPQYVDHIPLSGGELQRFAIAVVAI
QNAEIYMFDEPSSYLDVKQRLKAAQVVRSYVIVVEHDLSVLDYLSDFICCLYGKPGAYGV
VTLPFSVREGINIFLAGFVPTENLRFRDESLTFKGEFTDSQIIVMLGENGTGKTTFIRML
AGLLNVSYKPQKISPKFQNSVRHLLHQKIRDSYMHPQFMSDVMKPLQIEQLMDQEVVNLS
GGELQRVALTLCLGKPADIYLIDEPSAYLDSEQRIVASKVIKRFILHAKKTAFVVEHDFI
MATYLADRVIVYEGQPSIDCTANCPQSLLSGMNLFLSHLNITFRRDPTNFRPRINKLEST
KDREQKSAGSYY
>Species_Y
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------------------------------------------MLGENGTGKTTFIRML
AG--NVSYKPQ--------TVRQLLHDKIRDAYTHPQFVSDVIRPLQIEQLLDQVVKTLS
GGEKQRVAITLCLGKPADIYLIDEPSAHLDSEQRITASKVIKRFILHAKKTAFIVEHDFI
MATYLADRVIVYEGQPAVKCIAHSPQSLLSGMNLFLSHLNITFRRDPTNFRPRINKLESI
KDKEQKTAGSYY
>Species_Z
PFGAIHIINLPHRYSANSFKLHRLPMPRPGQVLGLVGTNGIGKSTALKILSGKLKPNLGR
FDNPPDWEEILKYFRGSELQNYFTKVLEDDLKAVVKPQYVDQIPLSGGELQRFAIGLVCV
QKADVYMFDEPSSYLDVKQRLAAARSIREYVIVVEHDLSVLDYLSDFVCVLYGRPALYGV
VTLPASVREGINIFLDGHIPTENLRFREESLTFRGSFTDSEIIVMMGENGTGKTTFCKML
AGAENISMKPQKITPKFQGTVRQLFFKRIKAAFLSPQFQTDVYKPLKIDDFIDQEVQNLS
GGELQRVAIVLALGIPADIYLIDEPSAYLDSEQRIVASRVIKRFIMHTKKTAFIVEHDFI
MATYLADRVIVFDGQPSVDAHANAPESLVTGCNTFLKNLDVTFRRDPNSYRPRINKYQSQ
MDQEQKLAGNY-

### to_remove.txt ###

Species_X
Species_Z

### desired output ###

>Species_Y
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------------------------------------------MLGENGTGKTTFIRML
AG--NVSYKPQ--------TVRQLLHDKIRDAYTHPQFVSDVIRPLQIEQLLDQVVKTLS
GGEKQRVAITLCLGKPADIYLIDEPSAHLDSEQRITASKVIKRFILHAKKTAFIVEHDFI
MATYLADRVIVYEGQPAVKCIAHSPQSLLSGMNLFLSHLNITFRRDPTNFRPRINKLESI
KDKEQKTAGSYY

sequence alignment • 6.0k views

ADD COMMENT • link 7.0 years ago by jan • 0

1

Entering edit mode

Did you search yourself already for a solution on this site? This is asked a zillion times already and answered as well.

ADD REPLY • link 7.0 years ago by Benn 8.3k

score 2 · Answer 1 · 2017-04-14

2

Entering edit mode

7.0 years ago

GenoMax 141k

Get faSomeRecords utility from Jim Kent. Add execute permissions if needed (chmod u+x faSomeRecords).

Run it like this: ./faSomeRecords -exclude your_original.fa id_to_exclude final.fa

List id's to exclude in id_to_exclude file, one on each line.

ADD COMMENT • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

Thank you! This works well on Linux (couldn't make it run on OSX before and therefore hoped to find a raw/executable script here).

ADD REPLY • link 7.0 years ago by jan • 0

1

Entering edit mode

Here is a link to macOS version of the same program in case someone finds this thread in future. Rest of the instructions are the same.

ADD REPLY • link 7.0 years ago by GenoMax 141k